Synthetic Data for LLM Training: What's Working and What Isn't
The idea of using AI-generated data to train AI models provokes reasonable skepticism. It sounds like pulling yourself up by your bootstraps — circular, self-referential, bound to collapse into a mess of amplified errors. And sometimes it does exactly that. But careful synthetic data approaches are producing genuine improvements in model capability, particularly for specialised domains.
Understanding where synthetic data works and where it fails matters for anyone building or fine-tuning language models.
Why Synthetic Data Exists
The straightforward answer: real data is expensive, slow to collect, and often insufficient for specialised tasks.
Fine-tuning an LLM for medical question answering requires thousands of high-quality medical Q&A pairs with verified answers. Collecting these from real physicians costs enormous amounts of money and time. Regulatory requirements around patient data add complexity. The volume needed for effective fine-tuning is difficult to achieve through manual collection alone.
Synthetic data generation uses a capable model (typically a frontier LLM like GPT-4 or Claude) to generate training examples that can supplement or replace manually collected data. The generated examples are filtered, validated, and used to train smaller or more specialised models.
This isn’t new in ML — synthetic data has been used in computer vision for years. Generating synthetic training images through augmentation, rendering, or generative models is standard practice. Applying similar approaches to text is a natural extension.
What’s Actually Working
Instruction following. The original Alpaca paper demonstrated that a relatively small dataset of synthetic instructions generated by GPT-4 could fine-tune a base LLaMA model to follow instructions reasonably well. This basic approach — use a strong model to generate instruction-response pairs, train a weaker model on them — has been refined and remains effective.
Domain adaptation. Generating domain-specific Q&A pairs, conversation examples, and reasoning chains for specialised fields (legal, medical, financial) produces models that perform substantially better in those domains than general-purpose models. The synthetic data doesn’t need to be perfect — it needs to be representative of the domain’s language, concepts, and reasoning patterns.
Reasoning chains. Generating step-by-step reasoning examples (chain-of-thought) using strong models, then training smaller models on these examples, transfers some reasoning capability. The smaller model learns the pattern of breaking problems into steps, even if it can’t reason as deeply as the model that generated the training data.
Data augmentation for rare scenarios. Real datasets often underrepresent edge cases. Synthetic generation can create examples of unusual scenarios — rare error conditions, atypical user queries, adversarial inputs — that improve model robustness in situations rarely seen in organic data.
Team400 published a case study recently where they used synthetic data to fine-tune a classification model for an Australian financial services client. The domain-specific terminology and regulatory context made generic models unreliable, and collecting enough real labelled examples would have taken months.
What Doesn’t Work Well
Model collapse. Training models recursively on their own outputs (or outputs from similar models) degrades quality over generations. Each generation amplifies errors and reduces diversity. The distribution of generated text narrows, losing the variation present in real human-written text. This is sometimes called “model collapse” and it’s a real limitation on synthetic data approaches.
Factual accuracy at scale. Generating thousands of factual Q&A pairs inevitably includes hallucinated facts. Quality filtering catches some of these, but at scale, incorrect information leaks into training data. Models trained on these examples may confidently reproduce fabricated facts. For domains where factual accuracy is critical, synthetic data requires extensive verification that partly offsets the cost savings.
Capturing genuine human diversity. AI-generated text has characteristic patterns — certain sentence structures, vocabulary preferences, reasoning styles. Training exclusively on synthetic data produces models that sound artificial. They’re competent but lack the messy, variable quality of human-generated content. Real data provides diversity that synthetic data struggles to replicate.
Complex multi-step reasoning. Synthetic reasoning chains generated by current models have ceiling limitations. The generating model’s reasoning ability limits the quality of reasoning examples. You can’t generate examples of reasoning that exceeds the generator’s capability. This creates a fundamental limit on how much reasoning transfer is possible through synthetic data.
Quality Filtering Is Everything
Raw synthetic data is not directly usable for training. The quality distribution includes excellent examples, mediocre ones, and outright errors. The filtering pipeline that separates good from bad determines whether synthetic data helps or hurts.
Automated filtering uses metrics like perplexity, consistency checks, format validation, and classifier-based quality scoring. This catches obviously bad examples but misses subtler quality issues.
Human review of a sample provides calibration for automated filters. Reviewing 500-1000 examples from a synthetic batch helps calibrate quality thresholds for automated filtering of the remaining thousands.
Deduplication matters more than with real data because generative models tend to produce similar outputs across different prompts. Near-duplicate examples waste training budget and can bias the model toward repeated patterns.
Adversarial filtering — deliberately trying to find bad examples through targeted searches for common LLM failure modes — catches problems that random sampling misses.
The HuggingFace community has developed several open-source filtering tools specifically designed for synthetic dataset curation. These are worth exploring before building custom filtering pipelines from scratch.
Practical Guidelines
Mix synthetic with real data. The best results come from combining synthetic data with real data rather than using synthetic data exclusively. Ratios vary by task, but 70-80% synthetic and 20-30% real is a common starting point for fine-tuning.
Use the strongest available model for generation. The quality ceiling of synthetic data is determined by the generating model. Using a frontier model to generate training data for a smaller model produces better results than using a comparable-quality model.
Diversify generation prompts. The same prompt structure generates similar outputs. Varying prompts, contexts, and generation parameters increases diversity in the synthetic dataset. This is analogous to data augmentation in computer vision.
Validate on held-out real data. Always evaluate models trained on synthetic data against real, human-generated test sets. Evaluation on synthetic test sets overestimates performance because the test set shares distributional characteristics with training data.
Document the generation process. Record which models generated the data, what prompts were used, filtering criteria applied, and any known limitations. This provenance information is essential for reproducibility and for understanding model behavior.
The Cost Equation
Synthetic data isn’t free. API costs for generating thousands of examples using frontier models add up. Filtering, validation, and curation require engineering time. The total cost is typically lower than manual data collection but higher than many teams initially estimate.
For a typical fine-tuning project generating 10,000 training examples using a frontier model API, expect generation costs in the hundreds to low thousands of dollars range. Add engineering time for pipeline development, filtering, and validation.
Compare this to manual collection where subject-matter experts might charge $50-200 per hour and produce 10-30 examples per hour depending on complexity. The cost advantage of synthetic data becomes clear at scale.
Looking Forward
Synthetic data approaches will continue improving as generating models get better. Higher-quality generators produce higher-quality synthetic data, which trains better specialised models. This is a genuine positive feedback loop, distinct from the negative loop of model collapse.
The research frontier is moving toward more sophisticated generation strategies — curriculum-based generation, self-play, iterative refinement — that produce higher quality synthetic data than simple prompt-response generation.
For practitioners, the current state is clear enough: synthetic data is a legitimate and useful tool for ML training when applied carefully. It’s not a replacement for real data, but it’s a powerful supplement that reduces data collection costs and timelines for specialised applications.