Model WarsFebruary 23, 2026via OpenAI Blog

Why we no longer evaluate SWE-bench Verified

Why it matters

OpenAI's public rejection of SWE-bench Verified—a widely-used coding benchmark—signals that frontier model evaluation is fragmenting. If the gold standard benchmark is compromised, how do you trust comparative claims about coding capability?

Key signals

OpenAI officially discontinued SWE-bench Verified evaluation
Identified training data leakage and test contamination in benchmark
Recommends migration to SWE-bench Pro as alternative
Signals broader concern: frontier coding benchmarks may be unreliable for measuring real progress
Published by OpenAI directly, not third-party analysis

The hook

OpenAI just declared SWE-bench Verified broken. Here's what that means for your AI eval strategy.

SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.

Read full story on OpenAI Blog

Relevance score:78/100

Why we no longer evaluate SWE-bench Verified

Get stories like this every Friday.