Balancing Knowledge Distillation for Imbalance Learning with Bilevel Optimization

Anh B.H. Nguyen; Ba Tho Phan; and Viet Cuong Ta

arXiv:2605.17839·cs.LG·May 20, 2026

Balancing Knowledge Distillation for Imbalance Learning with Bilevel Optimization

Anh B.H. Nguyen, Ba Tho Phan, and Viet Cuong Ta

PDF

TL;DR

This paper introduces BiKD, a bilevel optimization framework that dynamically balances hard and soft knowledge distillation losses on a per-sample basis, improving learning on imbalanced datasets.

Contribution

The paper proposes a novel bilevel framework with a weight generation network for adaptive sample-wise loss balancing in knowledge distillation.

Findings

01

BiKD outperforms recent balanced distillation methods on long-tailed CIFAR-10/100.

02

The adaptive weighting improves model performance across various imbalance factors.

03

Multi-step SGD enhances the efficiency of training the weight generation network.

Abstract

Knowledge distillation transfers knowledge from a high capacity teacher to a compact student using a mixture of hard and soft losses. On imbalanced data, a fixed weighting between hard and soft losses becomes brittle the learning process. Recent studies try to reweight these components in long-tailed settings. However, most of these methods do not adapt weights at the sample-wise level and do not take into account the students behavior during training. To address this, we propose BiKD -- a bilevel framework that dynamically balances hard and soft losses for each sample. We employ a weight generation network that produces adaptive per-sample weights, guided by a small balanced validation set. The student is now trained with an unconstrained combination of weighted hard and soft losses, allowing the student to relax both terms. We further propose a multi-step SGD strategy to optimize the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.