MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders

Jiajun Cao; Yuan Zhang; Tao Huang; Ming Lu; Qizhe Zhang; Ruichuan An,; Ningning MA; Shanghang Zhang

arXiv:2501.01709·cs.CV·March 19, 2025

MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders

Jiajun Cao, Yuan Zhang, Tao Huang, Ming Lu, Qizhe Zhang, Ruichuan An,, Ningning MA, Shanghang Zhang

PDF

Open Access 4 Models

TL;DR

MoVE-KD introduces a novel knowledge distillation framework that efficiently combines multiple visual encoders into a single model, leveraging input-dependent specialization and attention mechanisms to improve vision-language model performance.

Contribution

The paper proposes MoVE-KD, a new method that distills multiple visual encoders into one using low-rank adaptation, mixture-of-experts, and attention-based distillation strategies.

Findings

01

Effective in consolidating multiple encoders into a single model.

02

Improves performance on VLM benchmarks like LLaVA and LLaVA-NeXT.

03

Reduces computational cost while maintaining high accuracy.

Abstract

Visual encoders are fundamental components in vision-language models (VLMs), each showcasing unique strengths derived from various pre-trained visual foundation models. To leverage the various capabilities of these encoders, recent studies incorporate multiple encoders within a single VLM, leading to a considerable increase in computational cost. In this paper, we present Mixture-of-Visual-Encoder Knowledge Distillation (MoVE-KD), a novel framework that distills the unique proficiencies of multiple vision encoders into a single, efficient encoder model. Specifically, to mitigate conflicts and retain the unique characteristics of each teacher encoder, we employ low-rank adaptation (LoRA) and mixture-of-experts (MoEs) to selectively activate specialized knowledge based on input features, enhancing both adaptability and efficiency. To regularize the KD process and enhance performance, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Reservoir Computing

MethodsKnowledge Distillation