Model WarsApril 13, 2026via Hacker News

N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?

Why it matters

N-Day-Bench introduces a monthly-refreshing vulnerability discovery benchmark that tests frontier LLMs on real, uncontaminated code from GitHub security advisories—solving the critical problem of static benchmarks becoming obsolete as training data leaks render scores meaningless.

Key signals

Monthly refresh cycle prevents training data contamination and memorization
Tests GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, GLM-5.1, Kimi K2.5
Models get 24 shell steps to explore live codebases and identify vulnerabilities
Only repos with 10k+ stars qualify; diversity filtering prevents single-repo dominance
Three-agent evaluation: Curator (builds answer key), Finder (model under test), Judge (blinded scoring)
Public traces and live leaderboard enable transparent model comparison
Addresses practical security use case (vulnerability discovery) with real-world code samples

The hook

GPT-5.4 vs Claude vs Gemini: who actually finds real security bugs in production code?

N-Day-Bench tests whether frontier LLMs can find known security vulnerabilities in real repository code. Each month it pulls fresh cases from GitHub security advisories, checks out the repo at the last commit before the patch, and gives models a sandboxed bash shell to explore the codebase. Static vulnerability discovery benchmarks become outdated quickly. Cases leak into training data, and scores start measuring memorization. The monthly refresh keeps the test set ahead of contamination — or at least makes the contamination window honest. Each case runs three agents: a Curator reads the advisory and builds an answer key, a Finder (the model under test) gets 24 shell steps to explore the code and write a structured report, and a Judge scores the blinded submission. The Finder never sees the patch. It starts from sink hints and must trace the bug through actual code. Only repos with 10k+ stars qualify. A diversity pass prevents any single repo from dominating the set. Ambiguous advisories (merge commits, multi-repo references, unresolvable refs) are dropped. Currently evaluating GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, GLM-5.1, and Kimi K2.5. All traces are public. Methodology: https://ndaybench.winfunc.com/methodology Live Leaderboard: https://ndaybench.winfunc.com/leaderboard Live Traces: https://ndaybench.winfunc.com/traces Comments URL: https://news.ycombinator.com/item?id=47758347 Points: 53 # Comments: 14

Read full story on Hacker News

Relevance score:78/100

N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?

Get stories like this every Friday.