Model WarsApril 5, 2026via The Decoder

AI benchmarks systematically ignore how humans disagree, Google study finds

Why it matters

This research exposes a fundamental flaw in how AI models are evaluated, potentially invalidating benchmark results that companies use to make critical AI deployment decisions.

Key signals

Standard benchmarks use only 3-5 human raters per test example
Current rating methodology produces unreliable results
Annotation budget allocation matters as much as total budget
Human disagreement is systematically ignored in current benchmarks

The hook

Three to five human raters. That's what most AI benchmarks use - and Google's new study shows it's not nearly enough.

A Google study finds that the standard three to five human raters per test example often aren't enough for reliable AI benchmarks, and that splitting your annotation budget the right way matters just as much as the budget itself. The article AI benchmarks systematically ignore how humans disagree, Google study finds appeared first on The Decoder.

Read full story on The Decoder

Relevance score:75/100

AI benchmarks systematically ignore how humans disagree, Google study finds

Get stories like this every Friday.