MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models

Jianhong Tu; Zhuohao Ni; Nicholas Crispino; Zihao Yu; Michael Bendersky; Beliz Gunel; Ruoxi Jia; Xin Liu; Lingjuan Lyu; Dawn Song; Chenguang Wang

arXiv:2411.10557·cs.CL·July 1, 2025

MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models

Jianhong Tu, Zhuohao Ni, Nicholas Crispino, Zihao Yu, Michael Bendersky, Beliz Gunel, Ruoxi Jia, Xin Liu, Lingjuan Lyu, Dawn Song, Chenguang Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a visual instruction tuning method that leverages diverse text-only data to enhance zero-shot generalization in multimodal language models, reducing reliance on vision-language data.

Contribution

It demonstrates that incorporating extensive text-only data in visual instruction tuning can match traditional vision-heavy methods in performance with fewer training tokens.

Findings

01

Text-heavy instruction tuning matches vision-heavy methods in accuracy.

02

Diverse text-only data enables knowledge transfer across modalities.

03

The approach is more efficient, using half the training tokens.

Abstract

We present a novel visual instruction tuning strategy to improve the zero-shot task generalization of multimodal large language models by building a firm text-only knowledge base. Existing work lacks sufficient experimentation on the importance of each modality in the instruction tuning stage, often using a majority of vision-language data while keeping text-only data limited and fixing mixtures of modalities. By incorporating diverse text-only data in the visual instruction tuning stage, we vary vision-language data in various controlled experiments to investigate the importance of modality in visual instruction tuning. Our comprehensive evaluation shows that the text-heavy instruction tuning approach is able to perform on-par with traditional vision-heavy mixtures on both modalities across 12 general datasets while using as low as half the total training tokens. We find that simply…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wang-research-lab/mlan
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques

MethodsLLaMA