Knowledge Distillation with Noisy Labels for Natural Language Understanding
Shivendra Bhardwaj, Abbas Ghaddar, Ahmad Rashid, Khalil Bibi,, Chengyang Li, Ali Ghodsi, Philippe Langlais, Mehdi Rezagholizadeh

TL;DR
This paper investigates the effects of noisy labels on Knowledge Distillation in Natural Language Understanding and proposes two methods to mitigate label noise, demonstrating effectiveness on the GLUE benchmark.
Contribution
It is the first study to analyze and address noisy labels in KD for NLU, introducing two mitigation techniques and evaluating them on standard benchmarks.
Findings
Methods are effective under high noise levels
Label noise significantly impacts KD performance
More research needed for robust solutions
Abstract
Knowledge Distillation (KD) is extensively used to compress and deploy large pre-trained language models on edge devices for real-world applications. However, one neglected area of research is the impact of noisy (corrupted) labels on KD. We present, to the best of our knowledge, the first study on KD with noisy labels in Natural Language Understanding (NLU). We document the scope of the problem and present two methods to mitigate the impact of label noise. Experiments on the GLUE benchmark show that our methods are effective even under high noise levels. Nevertheless, our results indicate that more research is necessary to cope with label noise under the KD.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Machine Learning and Algorithms · Advanced Neural Network Applications
