Pre-training data
- Definition
- The massive datasets used to train foundation models during the pre-training phase, typically composed of web crawls, books, academic papers, code repositories, and other text sources. Pre-training data quality and composition directly determine model capabilities.
- Why it matters
- Pre-training data is becoming the most valuable and contested resource in AI. As model architectures converge, data quality is the primary differentiator between models. The legal landscape is in flux: lawsuits from publishers, artists, and software developers challenge whether training on copyrighted content constitutes fair use. Licensing deals worth billions of dollars are being signed for access to high-quality training data. For enterprises, pre-training data provenance matters for compliance: if your model was trained on data that is later ruled to violate copyright, you may face liability. Understanding what went into a model's training data is essential for evaluating its suitability for your use case.
- In practice
- Common Crawl, a nonprofit web archive containing petabytes of web pages, is the backbone of most LLM training datasets. The Pile, curated by EleutherAI, combined 22 diverse datasets and was used to train many open-source models. RedPajama and FineWeb are community-curated alternatives. Legal battles: the New York Times sued OpenAI for training on its articles; Getty Images sued Stability AI for using its photos. Licensing deals: OpenAI partnered with Axel Springer, AP, and others; Google licensed Reddit data for $60M/year. The pre-2023 web is increasingly valued because it predates widespread AI-generated content, making it less likely to cause model collapse.
We cover data & training every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Pre-training
The initial phase of model training where the network learns general knowledge from a massive dataset. Pre-training is the most expensive phase, often costing tens or hundreds of millions of dollars for frontier models.
Data curation
The process of selecting, cleaning, filtering, deduplicating, and organizing training data to maximize model quality. Data curation is increasingly recognized as more important than dataset size for model performance.
Model collapse
A degradation phenomenon where AI models trained on data generated by other AI models progressively lose quality, diversity, and capability. Model collapse occurs when synthetic data replaces human-generated data in training pipelines.
Synthetic data
Artificially generated training data created by AI models or simulations. Synthetic data is increasingly used when real data is scarce, private, or expensive, but quality and diversity remain open challenges.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.