Safety & GovernanceDeep Dive

Data poisoning

Definition
An attack that corrupts a model's training data to introduce backdoors, biases, or degraded performance. Data poisoning can be targeted (affecting specific outputs) or untargeted (generally degrading model quality).
Why it matters
As AI models train on increasingly large and diverse datasets, the attack surface for data poisoning grows. A sophisticated attacker can inject carefully crafted examples into public datasets, web pages, or code repositories that get scraped into training data. The poisoned model then behaves normally except when triggered by specific inputs, making detection extremely difficult. This is not hypothetical: researchers have demonstrated practical data poisoning attacks against production models. For enterprise AI deployments, data poisoning represents a supply chain security risk. You need to trust not just the model vendor, but the entire data pipeline that produced the model.
In practice
Researchers at ETH Zurich demonstrated in 2024 that poisoning just 0.01% of a large training dataset could cause targeted model failures, such as always recommending a specific product. The 'Nightshade' tool, developed at the University of Chicago, lets artists add invisible perturbations to their images that corrupt AI models trained on them. Google's research team showed that Wikipedia edits, which flow into many training datasets, could be weaponized for data poisoning. These findings have driven labs to invest in data provenance tracking, training data auditing, and robust training techniques that reduce sensitivity to corrupted samples.

We cover safety & governance every week.

Get the 5 AI stories that matter — free, every Friday.

Know the terms. Know the moves.

Get the 5 AI stories that matter every Friday — free.

Free forever. No spam.