Capability elicitation
- Definition
- Techniques for discovering the full extent of what an AI model can do, including hidden or emergent capabilities that were not explicitly trained for. Elicitation probes whether a model has dangerous capabilities that standard benchmarks might miss.
- Why it matters
- Models often know more than they show. Standard evaluations test what a model does with a basic prompt, but sophisticated prompting, fine-tuning, or scaffolding can unlock capabilities that basic testing misses. This matters enormously for safety: if a model has latent bioweapon synthesis knowledge that only emerges with careful prompting, standard safety evals will not catch it. For AI labs, capability elicitation is essential for responsible deployment, as you cannot mitigate risks you have not discovered. For competitors, elicitation research reveals the true capability frontier, which is often well beyond what marketing materials suggest.
- In practice
- Anthropic's Responsible Scaling Policy requires capability elicitation testing at each new model level, specifically probing for CBRN (chemical, biological, radiological, nuclear) knowledge and autonomous replication ability. METR (Model Evaluation and Threat Research) runs independent elicitation evaluations for AI labs, testing whether models can autonomously acquire resources, write malware, or manipulate humans. Research has shown that chain-of-thought prompting, multi-turn dialogue, and tool access can unlock capabilities that are undetectable in single-turn benchmarks, making elicitation a critical component of pre-deployment safety testing.
We cover safety & governance every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Red teaming
The practice of systematically probing an AI system to find vulnerabilities, biases, and failure modes before deployment. Red teaming is now standard practice at major AI labs and increasingly required by regulation.
AI safety
The interdisciplinary field focused on ensuring AI systems behave as intended and do not cause unintended harm. Encompasses alignment research, red teaming, content filtering, and policy advocacy.
Evals
Systematic evaluation frameworks that measure AI model performance on specific tasks relevant to your use case, going beyond generic benchmarks to test the behaviors that actually matter for your application.
Responsible scaling policy
A governance framework that ties the deployment of increasingly capable AI models to demonstrated safety evaluations, creating commitments about what safety conditions must be met before a model can be released or scaled.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.