Definition
Synthetic data is training data generated by another model (or a procedural system) rather than collected from the world. It’s how a lot of frontier reasoning training actually gets done, and it raises sharp questions about what gets baked in along with the answers.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.
- Constitutional AI: Harmlessness from AI Feedback
- Alignment faking in large language models
- ToolBench: Facilitating Large Language Models to Master 16000+ Real-world APIs
- GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
- Model Collapse Demystified: The Case Against Synthetic Training Data