AME: Aligned Manifold Entropy for Robust Vision-Language Distillation
Guiming Cao, Yuming Ou

TL;DR
This paper introduces AME, a novel entropy-based method for robust vision-language knowledge distillation that performs well even with limited data, without modifying model architecture.
Contribution
It proposes a plug-and-play entropy minimization technique over a shared manifold to improve generalization in vision-language distillation under low-data conditions.
Findings
Consistently improves distillation robustness across architectures
Achieves better generalization on downstream tasks
Theoretically tightens generalization error bounds
Abstract
Knowledge distillation is a long-established technique for knowledge transfer, and has regained attention in the context of the recent emergence of large vision-language models (VLMs). However, vision-language knowledge distillation often requires sufficient training data to achieve robust generalization on amples with ambiguous or boundary-adjacent representations, which are associated with high predictive uncertainty. Critically, collecting such large-scale, task-specific data for training is often impractical in real-world scenarios. To address this major challenge arising from the entanglement of uncertainty and cross-modal feature representation, we propose Aligned Manifold Entropy for Robust Vision-Language Distillation (AME), aiming to achieve robust generalization under real-world conditions. AME applies entropy minimization over a reconfigured shared manifold, where multi-modal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
