Balancing Knowledge Distillation for Imbalance Learning with Bilevel Optimization
Anh B.H. Nguyen, Ba Tho Phan, and Viet Cuong Ta

TL;DR
This paper introduces BiKD, a bilevel optimization framework that dynamically balances hard and soft knowledge distillation losses on a per-sample basis, improving learning on imbalanced datasets.
Contribution
The paper proposes a novel bilevel framework with a weight generation network for adaptive sample-wise loss balancing in knowledge distillation.
Findings
BiKD outperforms recent balanced distillation methods on long-tailed CIFAR-10/100.
The adaptive weighting improves model performance across various imbalance factors.
Multi-step SGD enhances the efficiency of training the weight generation network.
Abstract
Knowledge distillation transfers knowledge from a high capacity teacher to a compact student using a mixture of hard and soft losses. On imbalanced data, a fixed weighting between hard and soft losses becomes brittle the learning process. Recent studies try to reweight these components in long-tailed settings. However, most of these methods do not adapt weights at the sample-wise level and do not take into account the students behavior during training. To address this, we propose BiKD -- a bilevel framework that dynamically balances hard and soft losses for each sample. We employ a weight generation network that produces adaptive per-sample weights, guided by a small balanced validation set. The student is now trained with an unconstrained combination of weighted hard and soft losses, allowing the student to relax both terms. We further propose a multi-step SGD strategy to optimize the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
