Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs

Tianle Chen; Chaitanya Chakka; Arjun Reddy Akula; Xavier Thomas; Deepti Ghadiyaram

arXiv:2511.22826·cs.CV·December 4, 2025

Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs

Tianle Chen, Chaitanya Chakka, Arjun Reddy Akula, Xavier Thomas, Deepti Ghadiyaram

PDF

Open Access

TL;DR

This paper evaluates the robustness of Multimodal Large Language Models (MLLMs) to conflicting modalities, revealing their vulnerabilities and proposing an alignment tuning method to improve multimodal reasoning and reliability.

Contribution

The paper introduces MMA-Bench for testing modality reliance, analyzes MLLMs' brittleness, and proposes a modality alignment tuning strategy to enhance cross-modal reasoning.

Findings

01

MLLMs are sensitive to misaligned audio-visual inputs.

02

Current MLLMs struggle with simple misleading text.

03

Alignment tuning improves multimodal grounding.

Abstract

Despite remarkable advancements in Multimodal Large Language Models (MLLMs), a fundamental question remains: are MLLMs robust to contradicting modalities? To rigorously study this, we introduce MMA-Bench comprising videos and tasks that probe a model's reliance on specific modalities. Using black-box and white-box interpretability techniques, we provide a critical analysis of the brittleness of both open- and closed-sourced MLLMs. We show that current MLLMs struggle under misaligned audio-visual pairs and simple misleading text, thereby lacking robust multi-modal reasoning. Building on these findings, we propose a modality alignment tuning strategy to teach the model when to prioritize, leverage, or ignore specific modality cues. Through extensive experiments and analysis, we show that our alignment tuning yields demonstrably stronger multimodal grounding. This work provides both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Topic Modeling