Question

During model distillation, what is the specific purpose of using the soft labels (probability distributions) produced by the teacher model instead of the hard ground-truth labels for the student model?

Accepted Answer

In model distillation, the primary purpose of using soft labels—which are the probability distributions generated by the teacher model’s output layer—instead of hard ground-truth labels is to capture the rich relational information embedded within the teacher&#x27;s internal logic. A hard label provides only a single correct category, such as assigning a label of 1 to a cat and 0 to a dog in a binary image classification task. Soft labels, however, provide a full probability distribution, such as 0.9 for cat, 0.09 for dog, and 0.01 for car. This distribution reveals that the teacher model views a cat as more similar to a dog than to a car. By training on these soft labels, the student model learns these nuanced inter-class relationships, often referred to as dark knowledge. Because the student model is typically smaller and less complex than the teacher, these subtle hints about which classes are similar help the student converge more efficiently and generalize better to new data. Hard labels provide binary feedback that ignores the relative distances between incorrect classes, whereas soft labels provide a continuous supervisory signal that guides the student model to replicate the teacher&#x27;s sophisticated reasoning process rather than just the final classification decision.

Home → All Courses → Engineering and Technology Courses → Artificial Intelligence Engineering → Flashcard

During model distillation, what is the specific purpose of using the soft labels (probability distributions) produced by the teacher model instead of the hard ground-truth labels for the student model?