TL;DR
Switch-KD introduces a novel visual-switch distillation framework that unifies vision-language knowledge transfer within a shared text-probability space, enhancing multimodal model performance efficiently.
Contribution
The paper proposes Switch-KD, a new method for multimodal knowledge distillation that explicitly aligns visual and language modalities in a shared probabilistic space.
Findings
Distilled TinyLLaVA achieves 3.6 points average improvement across 10 benchmarks.
Switch-KD effectively transfers multimodal knowledge from a 3B teacher to a 0.5B student.
The method improves model performance without architectural modifications.
Abstract
Vision-Language Models (VLMs) have shown remarkable capabilities in joint vision-language understanding, but their large scale poses significant challenges for deployment in resource-constrained scenarios. Knowledge Distillation (KD) offers a viable way to improve model capabilities without increasing model size or data requirements, making deployment more efficient. However, applying KD to VLMs is challenged by modality-specific supervision: although multimodal knowledge in VLMs is fused within the language space, current methods supervise each modality separately without explicitly addressing multimodal alignment, leading to inconsistent multimodal knowledge transfer. To address this, we propose Switch-KD, a visual-switch distillation framework that unifies vision-language knowledge transfer within a shared text-probability space. Switch-KD comprises two key components: (1)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
