Model collapse
- Definition
- A degradation phenomenon where AI models trained on data generated by other AI models progressively lose quality, diversity, and capability. Model collapse occurs when synthetic data replaces human-generated data in training pipelines.
- Why it matters
- Model collapse is the AI industry's potential tragedy of the commons. As AI-generated content floods the internet, future models trained on web crawls will ingest increasingly synthetic data. Research shows this creates a degenerative feedback loop: each generation of models produces slightly lower-quality output, which becomes training data for the next generation. The implications are profound: the window for high-quality human-generated training data may be closing, making existing datasets (pre-2023 web data, licensed content) increasingly valuable. For AI labs, model collapse risk is driving investment in data provenance, human-verified content, and synthetic data quality filters.
- In practice
- Researchers at Rice and Stanford published a landmark 2023 paper demonstrating model collapse across multiple generations of training on AI-generated text. The effect was measurable within 5-10 generations: output diversity collapsed and quality degraded significantly. In response, AI labs are investing in high-quality human data: OpenAI's partnerships with publishers (AP, Axel Springer), Google's licensing deals with Reddit, and Anthropic's emphasis on curated training data all reflect model collapse concerns. The open-source community developed tools to detect AI-generated text in training datasets, and 'Common Crawl vintage' (pre-2023 web data) became a recognized quality marker.
We cover data & training every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Synthetic data
Artificially generated training data created by AI models or simulations. Synthetic data is increasingly used when real data is scarce, private, or expensive, but quality and diversity remain open challenges.
Pre-training data
The massive datasets used to train foundation models during the pre-training phase, typically composed of web crawls, books, academic papers, code repositories, and other text sources. Pre-training data quality and composition directly determine model capabilities.
Data curation
The process of selecting, cleaning, filtering, deduplicating, and organizing training data to maximize model quality. Data curation is increasingly recognized as more important than dataset size for model performance.
AI slop
Low-quality, mass-produced AI-generated content flooding the internet, including formulaic blog posts, recycled social media content, and SEO spam. The term emerged as a cultural backlash against undifferentiated AI output.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.