EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision Tokens

Ze Feng; Sen Yang; Boqiang Duan; Wankou Yang; Jingdong Wang

arXiv:2511.21106·cs.CV·November 27, 2025

EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision Tokens

Ze Feng, Sen Yang, Boqiang Duan, Wankou Yang, Jingdong Wang

PDF

Open Access 1 Video

TL;DR

EM-KD introduces a novel knowledge distillation approach for efficient multimodal large language models, effectively aligning unbalanced vision tokens and enhancing model performance in accuracy and efficiency.

Contribution

The paper proposes EM-KD, a new distillation framework that addresses unbalanced vision tokens using Hungarian matching and introduces two innovative distillation strategies.

Findings

01

Outperforms prior efficient MLLMs in accuracy and efficiency.

02

Effective alignment of vision tokens improves knowledge transfer.

03

Achieves better results than previous distillation methods.

Abstract

Efficient Multimodal Large Language Models (MLLMs) compress vision tokens to reduce resource consumption, but the loss of visual information can degrade comprehension capabilities. Although some priors introduce Knowledge Distillation to enhance student models, they overlook the fundamental differences in fine-grained vision comprehension caused by unbalanced vision tokens between the efficient student and vanilla teacher. In this paper, we propose EM-KD, a novel paradigm that enhances the Efficient MLLMs with Knowledge Distillation. To overcome the challenge of unbalanced vision tokens, we first calculate the Manhattan distance between the vision logits of teacher and student, and then align them in the spatial dimension with the Hungarian matching algorithm. After alignment, EM-KD introduces two distillation strategies: 1) Vision-Language Affinity Distillation (VLAD) and 2) Vision…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision Tokens· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications