Data & TrainingCore

Pre-training

Definition
The initial phase of model training where the network learns general knowledge from a massive dataset. Pre-training is the most expensive phase, often costing tens or hundreds of millions of dollars for frontier models.
Why it matters
Pre-training is where a model's fundamental capabilities are established. Everything that follows, fine-tuning, alignment, deployment, builds on top of what was learned during pre-training. The cost and scale of pre-training create a natural oligopoly: only organizations with $100M+ budgets and access to thousands of GPUs can train frontier models from scratch. This makes pre-training a barrier to entry that shapes the entire competitive landscape. Understanding pre-training also explains why fine-tuning works: the model has already learned language, reasoning, and world knowledge during pre-training, so fine-tuning only needs to steer these existing capabilities toward specific tasks.
In practice
GPT-4's pre-training reportedly cost over $100M and used tens of thousands of GPUs for several months. Llama 3 405B was trained on 15.6 trillion tokens using 16,000 H100 GPUs. Anthropic's Claude models use similarly massive pre-training runs. The pre-training process involves a simple objective, predicting the next token, applied at enormous scale. The dataset composition (web text, books, code, academic papers) heavily influences the model's strengths and weaknesses. Pre-training on more code produces better reasoning; more scientific text produces better technical knowledge. The pre-training recipe, including data mix, learning rate schedule, and architectural choices, is a closely guarded secret at every lab.

We cover data & training every week.

Get the 5 AI stories that matter — free, every Friday.

Know the terms. Know the moves.

Get the 5 AI stories that matter every Friday — free.

Free forever. No spam.