SFT (Supervised Fine-Tuning)
- Definition
- The process of training a pre-trained model on a curated dataset of input-output examples that demonstrate the desired behavior. SFT is typically the first alignment step after pre-training, teaching the model to follow instructions and produce useful responses.
- Why it matters
- SFT is the workhorse of model customization. While RLHF and DPO get more attention, SFT is where most of the behavioral shaping happens. The quality and diversity of your SFT dataset directly determines your model's instruction-following ability, domain knowledge, and output style. For enterprises building custom models, SFT is usually the highest-ROI investment: a few thousand high-quality examples can transform a generic base model into a domain expert. The key insight is that SFT example quality matters far more than quantity: 1,000 expert-written examples typically outperform 100,000 crowd-sourced ones.
- In practice
- The standard post-training pipeline is: SFT first (teaching instruction following), then RLHF or DPO (refining preferences and safety). Meta's Llama 3 used approximately 10M SFT examples across multiple rounds. OpenAI's fine-tuning API is essentially a managed SFT service. The open-source community uses datasets like OpenHermes, Alpaca, and ShareGPT for SFT. Key practices include: diverse task coverage (instruction following, conversation, analysis, coding), high-quality examples (expert-written, not auto-generated), and careful decontamination (ensuring evaluation data does not appear in training data). Companies like Scale AI provide custom SFT dataset creation services.
We cover data & training every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Fine-tuning
The process of continuing to train a pre-trained model on a smaller, task-specific dataset. Fine-tuning customizes model behavior for specific domains or formats and is a key part of most enterprise AI deployments.
Reinforcement Learning from Human Feedback (RLHF)
A training technique where human raters rank model outputs, and the model learns to prefer higher-ranked responses. RLHF is what makes AI assistants helpful, harmless, and conversational rather than just autocomplete.
DPO (Direct Preference Optimization)
A training technique that aligns language models with human preferences by directly optimizing on preference data, without needing a separate reward model. DPO simplifies the RLHF pipeline while achieving comparable alignment quality.
Pre-training
The initial phase of model training where the network learns general knowledge from a massive dataset. Pre-training is the most expensive phase, often costing tens or hundreds of millions of dollars for frontier models.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.