Rethinking the Text-Vision Reasoning Imbalance in MLLMs through the Lens of Training Recipes
Guanyu Yao, Qiucheng Wu, Yang Zhang, Zhaowen Wang, Handong Zhao, Shiyu Chang

TL;DR
This paper investigates the imbalance in multimodal large language models' reasoning abilities across visual and textual modalities, analyzing how training strategies influence this gap and proposing methods to achieve more balanced multimodal reasoning.
Contribution
It identifies how existing training recipes exacerbate the modality gap and explores data and loss strategies to mitigate this imbalance in MLLMs.
Findings
Existing training recipes tend to increase the modality gap.
Strategies from data and loss design can reduce the performance disparity.
Balanced training approaches improve multimodal reasoning capabilities.
Abstract
Multimodal large language models (MLLMs) have demonstrated strong capabilities on vision-and-language tasks. However, recent findings reveal an imbalance in their reasoning capabilities across visual and textual modalities. Specifically, current MLLMs often over-rely on textual cues while under-attending to visual content, resulting in suboptimal performance on tasks that require genuine visual reasoning. We refer to this phenomenon as the \textit{modality gap}, defined as the performance disparity between text-centric and vision-centric inputs. In this paper, we analyze the modality gap through the lens of training recipes. We first show that existing training recipes tend to amplify this gap. Then, we systematically explore strategies to bridge it from two complementary perspectives: data and loss design. Our findings provide insights into developing training recipes that mitigate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
