Toward Understanding Adversarial Distillation: Why Robust Teachers Fail
Hongsin Lee, Hye Won Chung

TL;DR
This paper investigates why adversarial distillation sometimes fails to improve student robustness, revealing that misalignment between teacher confidence and student limitations on unlearnable data causes robust overfitting.
Contribution
It provides a theoretical analysis of feature learning dynamics showing how teacher confidence on unlearnable samples affects student robustness, and offers a practical indicator for teacher selection.
Findings
Confident teachers on unlearnable samples induce student overfitting.
High uncertainty in teachers suppresses noise memorization, improving robustness.
Teacher's predictive entropy on unlearnable samples predicts student robustness.
Abstract
Adversarial Distillation aims to enhance student robustness by guiding the student with a robust teacher's soft labels within the min-max adversarial training framework, yet its success is notoriously inconsistent: a more robust teacher often fails to improve, or even harms, the student's robust generalization. In this paper, we identify a key mechanism of this teacher dependency: the misalignment between the teacher's supervisory confidence and the student's representational limitations on a consistent subset of training data -- the Robustly Unlearnable Set. We present a theoretical framework analyzing the feature learning dynamics of a two-layer neural network, demonstrating that this mismatch creates a dichotomy in distillation outcomes. We prove that when a teacher provides confident supervision on unlearnable samples, it compels the student to memorize spurious noise patterns that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
