CoMMIT: Coordinated Multimodal Instruction Tuning

Xintong Li; Junda Wu; Tong Yu; Yu Wang; Xiang Chen; Jiuxiang Gu; Lina Yao; Julian McAuley; Jingbo Shang

arXiv:2407.20454·cs.LG·September 10, 2025

CoMMIT: Coordinated Multimodal Instruction Tuning

Xintong Li, Junda Wu, Tong Yu, Yu Wang, Xiang Chen, Jiuxiang Gu, Lina Yao, Julian McAuley, Jingbo Shang

PDF

Open Access 1 Video

TL;DR

This paper introduces CoMMIT, a method for improving multimodal instruction tuning by balancing learning between language models and feature encoders, leading to better convergence and task performance.

Contribution

It proposes a Multimodal Balance Coefficient and a dynamic scheduler to coordinate learning, addressing oscillation and bias issues in multimodal instruction tuning.

Findings

01

Enhanced convergence stability in MLLMs

02

Improved downstream task performance

03

Effective across various architectures

Abstract

Instruction tuning in multimodal large language models (MLLMs) generally involves cooperative learning between a backbone LLM and a feature encoder of non-text input modalities. The major challenge is how to efficiently find the synergy between the two modules so that LLMs can adapt their reasoning abilities to downstream tasks while feature encoders can adjust to provide more task-specific information about its modality. In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives, where we find the unbalanced learning between the feature encoder and the LLM can cause problems of oscillation and biased learning that lead to sub-optimal convergence. Inspired by our findings, we propose a Multimodal Balance Coefficient that enables quantitative measurement of the balance of learning. Based on this, we further design a dynamic learning scheduler…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CoMMIT: Coordinated Multimodal Instruction Tuning· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems