Bayes Conditional Distribution Estimation for Knowledge Distillation Based on Conditional Mutual Information
Linfeng Ye, Shayan Mohajer Hamidi, Renhao Tan, En-Hui Yang

TL;DR
This paper introduces a novel MCMI estimator for knowledge distillation that maximizes conditional mutual information, leading to more accurate teacher models and improved student performance, especially in zero-shot and few-shot learning scenarios.
Contribution
The paper proposes the MCMI method that enhances BCPD estimation by integrating CMI maximization into teacher training, improving KD effectiveness.
Findings
Student accuracy improved by up to 3.32% with MCMI teachers.
Significant gains in zero-shot and few-shot settings, up to 84% accuracy.
MCMI captures more contextual information in images.
Abstract
It is believed that in knowledge distillation (KD), the role of the teacher is to provide an estimate for the unknown Bayes conditional probability distribution (BCPD) to be used in the student training process. Conventionally, this estimate is obtained by training the teacher using maximum log-likelihood (MLL) method. To improve this estimate for KD, in this paper we introduce the concept of conditional mutual information (CMI) into the estimation of BCPD and propose a novel estimator called the maximum CMI (MCMI) method. Specifically, in MCMI estimation, both the log-likelihood and CMI of the teacher are simultaneously maximized when the teacher is trained. Through Eigen-CAM, it is further shown that maximizing the teacher's CMI value allows the teacher to capture more contextual information in an image cluster. Via conducting a thorough set of experiments, we show that by employing a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI · Machine Learning and Algorithms
MethodsSparse Evolutionary Training · Knowledge Distillation
