PHI-S: Distribution Balancing for Label-Free Multi-Teacher Distillation
Mike Ranzinger, Jon Barker, Greg Heinrich, Pavlo Molchanov, Bryan, Catanzaro, Andrew Tao

TL;DR
This paper introduces PHI-S, a novel distribution balancing technique using Hadamard matrices for label-free multi-teacher distillation, improving student model quality by standardizing activation statistics.
Contribution
The paper proposes PHI Standardization (PHI-S), a new method employing Hadamard matrices for isotropic distribution alignment in multi-teacher distillation without labels.
Findings
PHI-S outperforms other normalization techniques in student model quality.
Hadamard matrices enable effective isotropic standardization of activation distributions.
Distribution balancing improves downstream teacher-matching metrics.
Abstract
Various visual foundation models have distinct strengths and weaknesses, both of which can be improved through heterogeneous multi-teacher knowledge distillation without labels, termed "agglomerative models." We build upon this body of work by studying the effect of the teachers' activation statistics, particularly the impact of the loss function on the resulting student model quality. We explore a standard toolkit of statistical normalization techniques to better align the different distributions and assess their effects. Further, we examine the impact on downstream teacher-matching metrics, which motivates the use of Hadamard matrices. With these matrices, we demonstrate useful properties, showing how they can be used for isotropic standardization, where each dimension of a multivariate distribution is standardized using the same scale. We call this technique "PHI Standardization"…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/C-RADIOv4-SO400Mmodel· 4.7k dl· ♡ 314.7k dl♡ 31
- 🤗nvidia/RADIOmodel· 340 dl· ♡ 43340 dl♡ 43
- 🤗nvidia/C-RADIOmodel· 4.8k dl· ♡ 264.8k dl♡ 26
- 🤗nvidia/RADIO-Bmodel· 33 dl· ♡ 333 dl♡ 3
- 🤗nvidia/RADIO-Lmodel· 24k dl· ♡ 1024k dl♡ 10
- 🤗nvidia/RADIO-Hmodel· 89 dl· ♡ 1089 dl♡ 10
- 🤗nvidia/C-RADIOv2-Hmodel· 4.9k dl· ♡ 114.9k dl♡ 11
- 🤗nvidia/C-RADIOv2-Bmodel· 315 dl· ♡ 10315 dl♡ 10
- 🤗nvidia/C-RADIOv2-Lmodel· 25 dl· ♡ 325 dl♡ 3
- 🤗nvidia/C-RADIOv2-gmodel· 13 dl· ♡ 1213 dl♡ 12
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsProcess Optimization and Integration
MethodsKnowledge Distillation · ALIGN
