Pre-training
- Definition
- The initial phase of model training where the network learns general knowledge from a massive dataset. Pre-training is the most expensive phase, often costing tens or hundreds of millions of dollars for frontier models.
- Why it matters
- Pre-training is where a model's fundamental capabilities are established. Everything that follows, fine-tuning, alignment, deployment, builds on top of what was learned during pre-training. The cost and scale of pre-training create a natural oligopoly: only organizations with $100M+ budgets and access to thousands of GPUs can train frontier models from scratch. This makes pre-training a barrier to entry that shapes the entire competitive landscape. Understanding pre-training also explains why fine-tuning works: the model has already learned language, reasoning, and world knowledge during pre-training, so fine-tuning only needs to steer these existing capabilities toward specific tasks.
- In practice
- GPT-4's pre-training reportedly cost over $100M and used tens of thousands of GPUs for several months. Llama 3 405B was trained on 15.6 trillion tokens using 16,000 H100 GPUs. Anthropic's Claude models use similarly massive pre-training runs. The pre-training process involves a simple objective, predicting the next token, applied at enormous scale. The dataset composition (web text, books, code, academic papers) heavily influences the model's strengths and weaknesses. Pre-training on more code produces better reasoning; more scientific text produces better technical knowledge. The pre-training recipe, including data mix, learning rate schedule, and architectural choices, is a closely guarded secret at every lab.
We cover data & training every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Training
The process of teaching a neural network by feeding it data and adjusting its parameters to minimize prediction errors. Training frontier models now costs $100M+ and takes months on thousands of GPUs.
Foundation model
A large, general-purpose model pre-trained on broad data that can be adapted to many downstream tasks. GPT-4, Claude, Gemini, and Llama are all foundation models. The term signals massive upfront investment and wide applicability.
Pre-training data
The massive datasets used to train foundation models during the pre-training phase, typically composed of web crawls, books, academic papers, code repositories, and other text sources. Pre-training data quality and composition directly determine model capabilities.
Scaling laws
Empirical relationships showing that model performance improves predictably as you increase data, compute, and parameters. Scaling laws are why labs are pouring billions into ever-larger training runs.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.