LLAVADI: What Matters For Multimodal Large Language Models Distillation

Shilin Xu; Xiangtai Li; Haobo Yuan; Lu Qi; Yunhai Tong; Ming-Hsuan; Yang

arXiv:2407.19409·cs.CL·July 30, 2024

LLAVADI: What Matters For Multimodal Large Language Models Distillation

Shilin Xu, Xiangtai Li, Haobo Yuan, Lu Qi, Yunhai Tong, Ming-Hsuan, Yang

PDF

Open Access

TL;DR

This paper investigates effective knowledge distillation strategies for training small-scale Multimodal Large Language Models (MLLMs), demonstrating that proper alignment techniques enable smaller models to match larger ones' performance.

Contribution

It provides the first comprehensive study on multimodal distillation, highlighting key training strategies and alignment methods that improve small MLLMs' performance.

Findings

01

Joint token and logit alignment are crucial for effective distillation.

02

A 2.7B model can match larger models' performance with proper strategies.

03

The study offers practical guidelines for training small-scale MLLMs.

Abstract

The recent surge in Multimodal Large Language Models (MLLMs) has showcased their remarkable potential for achieving generalized intelligence by integrating visual understanding into Large Language Models.Nevertheless, the sheer model size of MLLMs leads to substantial memory and computational demands that hinder their widespread deployment. In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch. Instead, we focus on what matters for training small-scale MLLMs through knowledge distillation, which is the first step from the multimodal distillation perspective. Our extensive studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process. These results show that joint alignment for both tokens and logit alignment plays critical roles in teacher-student frameworks. In addition, we draw a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsFocus · Knowledge Distillation