Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods
Jiali Cheng, Chirag Agarwal, Hadi Amiri

TL;DR
This paper investigates how knowledge distillation affects the transfer of debiasing capabilities from teacher to student models in NLP and image classification, revealing challenges and proposing solutions to improve debiasing transfer.
Contribution
It is the first comprehensive study analyzing the impact of knowledge distillation on debiasing, identifying internal mechanisms and proposing methods to enhance debiasing transfer.
Findings
Debiasing capability is generally undermined after KD.
Training a debiased model does not benefit from teacher knowledge.
Significant bias-specific variations occur post-distillation.
Abstract
Knowledge distillation (KD) is an effective method for model compression and transferring knowledge between models. However, its effect on model's robustness against spurious correlations that degrade performance on out-of-distribution data remains underexplored. This study investigates the effect of knowledge distillation on the transferability of ``debiasing'' capabilities from teacher models to student models on natural language inference (NLI) and image classification tasks. Through extensive experiments, we illustrate several key findings: (i) overall the debiasing capability of a model is undermined post-KD; (ii) training a debiased model does not benefit from injecting teacher knowledge; (iii) although the overall robustness of a model may remain stable post-distillation, significant variations can occur across different types of biases; and (iv) we pin-point the internal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
