Model WarsMay 31, 2023via OpenAI Blog

Improving mathematical reasoning with process supervision

Why it matters

OpenAI demonstrates a new training methodology (process supervision) that achieves state-of-the-art mathematical reasoning while improving model alignment through step-by-step human validation. This signals a shift in how frontier labs approach both capability scaling and safety.

Key signals

New SOTA achieved in mathematical problem solving
Process supervision (rewarding each correct step) outperforms outcome supervision (rewarding final answer only)
Alignment benefit: model trained to produce human-endorsed chain-of-thought reasoning
Published May 31, 2023 by OpenAI

The hook

OpenAI just cracked a capability gap. Process supervision beats outcome supervision on math reasoning—and it's an alignment win too.

We've trained a model to achieve a new state-of-the-art in mathematical problem solving by rewarding each correct step of reasoning (“process supervision”) instead of simply rewarding the correct final answer (“outcome supervision”). In addition to boosting performance relative to outcome supervision, process supervision also has an important alignment benefit: it directly trains the model to produce a chain-of-thought that is endorsed by humans.

Read full story on OpenAI Blog

Relevance score:78/100

Improving mathematical reasoning with process supervision

Get stories like this every Friday.