Safety & GovernanceDeep Dive

Capability elicitation

Definition
Techniques for discovering the full extent of what an AI model can do, including hidden or emergent capabilities that were not explicitly trained for. Elicitation probes whether a model has dangerous capabilities that standard benchmarks might miss.
Why it matters
Models often know more than they show. Standard evaluations test what a model does with a basic prompt, but sophisticated prompting, fine-tuning, or scaffolding can unlock capabilities that basic testing misses. This matters enormously for safety: if a model has latent bioweapon synthesis knowledge that only emerges with careful prompting, standard safety evals will not catch it. For AI labs, capability elicitation is essential for responsible deployment, as you cannot mitigate risks you have not discovered. For competitors, elicitation research reveals the true capability frontier, which is often well beyond what marketing materials suggest.
In practice
Anthropic's Responsible Scaling Policy requires capability elicitation testing at each new model level, specifically probing for CBRN (chemical, biological, radiological, nuclear) knowledge and autonomous replication ability. METR (Model Evaluation and Threat Research) runs independent elicitation evaluations for AI labs, testing whether models can autonomously acquire resources, write malware, or manipulate humans. Research has shown that chain-of-thought prompting, multi-turn dialogue, and tool access can unlock capabilities that are undetectable in single-turn benchmarks, making elicitation a critical component of pre-deployment safety testing.

We cover safety & governance every week.

Get the 5 AI stories that matter — free, every Friday.

Know the terms. Know the moves.

Get the 5 AI stories that matter every Friday — free.

Free forever. No spam.