Model WarsFebruary 23, 2026via OpenAI Blog
Why we no longer evaluate SWE-bench Verified
Why it matters
OpenAI's public rejection of SWE-bench Verified—a widely-used coding benchmark—signals that frontier model evaluation is fragmenting. If the gold standard benchmark is compromised, how do you trust comparative claims about coding capability?
Key signals
- OpenAI officially discontinued SWE-bench Verified evaluation
- Identified training data leakage and test contamination in benchmark
- Recommends migration to SWE-bench Pro as alternative
- Signals broader concern: frontier coding benchmarks may be unreliable for measuring real progress
- Published by OpenAI directly, not third-party analysis
The hook
OpenAI just declared SWE-bench Verified broken. Here's what that means for your AI eval strategy.
SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.
Relevance score:78/100