Definition
Knowledge distillation trains a smaller “student” model to mimic the outputs of a larger “teacher,” producing a much cheaper model that retains a large fraction of the teacher’s capability. It’s the standard way labs convert a flagship model into a deployable lineup.