Data & TrainingDeep Dive

Pre-training data

Definition
The massive datasets used to train foundation models during the pre-training phase, typically composed of web crawls, books, academic papers, code repositories, and other text sources. Pre-training data quality and composition directly determine model capabilities.
Why it matters
Pre-training data is becoming the most valuable and contested resource in AI. As model architectures converge, data quality is the primary differentiator between models. The legal landscape is in flux: lawsuits from publishers, artists, and software developers challenge whether training on copyrighted content constitutes fair use. Licensing deals worth billions of dollars are being signed for access to high-quality training data. For enterprises, pre-training data provenance matters for compliance: if your model was trained on data that is later ruled to violate copyright, you may face liability. Understanding what went into a model's training data is essential for evaluating its suitability for your use case.
In practice
Common Crawl, a nonprofit web archive containing petabytes of web pages, is the backbone of most LLM training datasets. The Pile, curated by EleutherAI, combined 22 diverse datasets and was used to train many open-source models. RedPajama and FineWeb are community-curated alternatives. Legal battles: the New York Times sued OpenAI for training on its articles; Getty Images sued Stability AI for using its photos. Licensing deals: OpenAI partnered with Axel Springer, AP, and others; Google licensed Reddit data for $60M/year. The pre-2023 web is increasingly valued because it predates widespread AI-generated content, making it less likely to cause model collapse.

We cover data & training every week.

Get the 5 AI stories that matter — free, every Friday.

Know the terms. Know the moves.

Get the 5 AI stories that matter every Friday — free.

Free forever. No spam.