EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision Tokens
Ze Feng, Sen Yang, Boqiang Duan, Wankou Yang, Jingdong Wang

TL;DR
EM-KD introduces a novel knowledge distillation approach for efficient multimodal large language models, effectively aligning unbalanced vision tokens and enhancing model performance in accuracy and efficiency.
Contribution
The paper proposes EM-KD, a new distillation framework that addresses unbalanced vision tokens using Hungarian matching and introduces two innovative distillation strategies.
Findings
Outperforms prior efficient MLLMs in accuracy and efficiency.
Effective alignment of vision tokens improves knowledge transfer.
Achieves better results than previous distillation methods.
Abstract
Efficient Multimodal Large Language Models (MLLMs) compress vision tokens to reduce resource consumption, but the loss of visual information can degrade comprehension capabilities. Although some priors introduce Knowledge Distillation to enhance student models, they overlook the fundamental differences in fine-grained vision comprehension caused by unbalanced vision tokens between the efficient student and vanilla teacher. In this paper, we propose EM-KD, a novel paradigm that enhances the Efficient MLLMs with Knowledge Distillation. To overcome the challenge of unbalanced vision tokens, we first calculate the Manhattan distance between the vision logits of teacher and student, and then align them in the spatial dimension with the Hungarian matching algorithm. After alignment, EM-KD introduces two distillation strategies: 1) Vision-Language Affinity Distillation (VLAD) and 2) Vision…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
