Data curation
- Definition
- The process of selecting, cleaning, filtering, deduplicating, and organizing training data to maximize model quality. Data curation is increasingly recognized as more important than dataset size for model performance.
- Why it matters
- The era of 'just scrape more internet' is over. Research consistently shows that data quality trumps data quantity: a model trained on 1T carefully curated tokens outperforms one trained on 10T noisy tokens. This shift has made data curation one of the highest-leverage activities in AI development. The teams that excel at curation, filtering out low-quality content, removing duplicates, balancing domains, and ensuring diversity, produce better models at lower cost. For companies building proprietary models, your curation pipeline is your competitive advantage. For everyone else, understanding curation helps you evaluate which foundation models are likely to perform best in your domain.
- In practice
- Meta's Llama 3 used aggressive data curation, filtering a 15T token web crawl down to 1.8T high-quality tokens using classifier-based quality filtering, deduplication, and domain rebalancing. The result was a model that outperformed competitors trained on larger but noisier datasets. Phi-3 from Microsoft achieved remarkable performance at small model sizes specifically through intensive data curation, synthesizing high-quality training examples rather than using raw web data. The open-source community developed tools like RedPajama and FineWeb for curating training datasets, democratizing what was previously a secret sauce of frontier labs.
We cover data & training every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Pre-training data
The massive datasets used to train foundation models during the pre-training phase, typically composed of web crawls, books, academic papers, code repositories, and other text sources. Pre-training data quality and composition directly determine model capabilities.
Pre-training
The initial phase of model training where the network learns general knowledge from a massive dataset. Pre-training is the most expensive phase, often costing tens or hundreds of millions of dollars for frontier models.
Synthetic data
Artificially generated training data created by AI models or simulations. Synthetic data is increasingly used when real data is scarce, private, or expensive, but quality and diversity remain open challenges.
Data flywheel
A self-reinforcing loop where user interactions generate data that improves the model, which attracts more users, generating more data. Data flywheels are among the strongest moats in AI.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.