Model WarsFebruary 23, 2026via OpenAI Blog

Why we no longer evaluate SWE-bench Verified

Why it matters

OpenAI's public rejection of SWE-bench Verified—a widely-used coding benchmark—signals that frontier model evaluation is fragmenting. If the gold standard benchmark is compromised, how do you trust comparative claims about coding capability?

Key signals

  • OpenAI officially discontinued SWE-bench Verified evaluation
  • Identified training data leakage and test contamination in benchmark
  • Recommends migration to SWE-bench Pro as alternative
  • Signals broader concern: frontier coding benchmarks may be unreliable for measuring real progress
  • Published by OpenAI directly, not third-party analysis

The hook

OpenAI just declared SWE-bench Verified broken. Here's what that means for your AI eval strategy.

SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.
Relevance score:78/100

Get stories like this every Friday.

The 5 AI stories that matter — free, in your inbox.

Free forever. No spam.