Model WarsApril 5, 2026via The Decoder
AI benchmarks systematically ignore how humans disagree, Google study finds
Why it matters
This research exposes a fundamental flaw in how AI models are evaluated, potentially invalidating benchmark results that companies use to make critical AI deployment decisions.
Key signals
- Standard benchmarks use only 3-5 human raters per test example
- Current rating methodology produces unreliable results
- Annotation budget allocation matters as much as total budget
- Human disagreement is systematically ignored in current benchmarks
The hook
Three to five human raters. That's what most AI benchmarks use - and Google's new study shows it's not nearly enough.
A Google study finds that the standard three to five human raters per test example often aren't enough for reliable AI benchmarks, and that splitting your annotation budget the right way matters just as much as the budget itself.
The article AI benchmarks systematically ignore how humans disagree, Google study finds appeared first on The Decoder.
Relevance score:75/100