Data & TrainingDeep Dive

Data curation

Definition
The process of selecting, cleaning, filtering, deduplicating, and organizing training data to maximize model quality. Data curation is increasingly recognized as more important than dataset size for model performance.
Why it matters
The era of 'just scrape more internet' is over. Research consistently shows that data quality trumps data quantity: a model trained on 1T carefully curated tokens outperforms one trained on 10T noisy tokens. This shift has made data curation one of the highest-leverage activities in AI development. The teams that excel at curation, filtering out low-quality content, removing duplicates, balancing domains, and ensuring diversity, produce better models at lower cost. For companies building proprietary models, your curation pipeline is your competitive advantage. For everyone else, understanding curation helps you evaluate which foundation models are likely to perform best in your domain.
In practice
Meta's Llama 3 used aggressive data curation, filtering a 15T token web crawl down to 1.8T high-quality tokens using classifier-based quality filtering, deduplication, and domain rebalancing. The result was a model that outperformed competitors trained on larger but noisier datasets. Phi-3 from Microsoft achieved remarkable performance at small model sizes specifically through intensive data curation, synthesizing high-quality training examples rather than using raw web data. The open-source community developed tools like RedPajama and FineWeb for curating training datasets, democratizing what was previously a secret sauce of frontier labs.

We cover data & training every week.

Get the 5 AI stories that matter — free, every Friday.

Know the terms. Know the moves.

Get the 5 AI stories that matter every Friday — free.

Free forever. No spam.