LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

Yuxuan Cai; Jiangning Zhang; Haoyang He; Xinwei He; Ao Tong; Zhenye Gan; Chengjie Wang; Zhucun Xue; Yong Liu; Xiang Bai

arXiv:2410.16236·cs.CV·July 4, 2025

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

Yuxuan Cai, Jiangning Zhang, Haoyang He, Xinwei He, Ao Tong, Zhenye Gan, Chengjie Wang, Zhucun Xue, Yong Liu, Xiang Bai

PDF

Open Access 1 Repo

TL;DR

LLaVA-KD introduces a knowledge distillation framework that effectively transfers capabilities from large-scale multimodal models to smaller models, enhancing their performance in vision-language understanding without changing their architecture.

Contribution

The paper proposes a novel three-stage distillation framework with multimodal and relation distillation techniques to improve small-scale multimodal models using large-scale model knowledge.

Findings

01

Significant performance improvements in small MLLMs through distillation.

02

Effective transfer of visual and linguistic representations from large to small models.

03

Validation of each component's contribution via extensive experiments.

Abstract

The success of Large Language Models (LLMs) has inspired the development of Multimodal Large Language Models (MLLMs) for unified understanding of vision and language. However, the increasing model size and computational complexity of large-scale MLLMs (l-MLLMs) limit their use in resource-constrained scenarios. Although small-scale MLLMs (s-MLLMs) are designed to reduce computational costs, they typically suffer from performance degradation. To mitigate this limitation, we propose a novel LLaVA-KD framework to transfer knowledge from l-MLLMs to s-MLLMs. Specifically, we introduce Multimodal Distillation (MDist) to transfer teacher model's robust representations across both visual and linguistic modalities, and Relation Distillation (RDist) to transfer teacher model's ability to capture visual token relationships. Additionally, we propose a three-stage training scheme to fully exploit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Fantasyele/LLaVA-KD
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsALIGN