When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMs
Zhuoran Zhang, Tengyue Wang, Xilin Gong, Yang Shi, Haotian Wang, Di Wang, Lijie Hu

TL;DR
This paper introduces a framework to understand how multimodal large language models decide which modality to follow when faced with conflicting information, emphasizing the roles of uncertainty and inherent bias.
Contribution
It proposes a novel decomposition of modality following into relative reasoning uncertainty and inherent modality preference, validated by a controllable dataset and entropy-based metrics.
Findings
Follow probability decreases with increasing relative uncertainty.
The balance point indicates the model's inherent modality bias.
Models oscillate between modalities near the balance point, revealing internal decision mechanisms.
Abstract
Multimodal large language models (MLLMs) must resolve conflicts when different modalities provide contradictory information, a process we term modality following. Prior work measured this behavior only with coarse dataset-level statistics, overlooking the influence of model's confidence in unimodal reasoning. In this paper, we introduce a new framework that decomposes modality following into two fundamental factors: relative reasoning uncertainty (the case-specific confidence gap between unimodal predictions) and inherent modality preference( a model's stable bias when uncertainties are balanced). To validate this framework, we construct a controllable dataset that systematically varies the reasoning difficulty of visual and textual inputs. Using entropy as a fine-grained uncertainty metric, we uncover a universal law: the probability of following a modality decreases monotonically as…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Novel conceptual framework – Clearly decomposes modality-following behavior into relative uncertainty and inherent preference, offering a principled explanation beyond prior dataset-level analyses. 2. Mechanistic insight – The layer-wise oscillation analysis provides an interpretable link between internal model dynamics and external behavioral indecision. 3. Practical interpretability – The notion of a “balance point” gives a simple quantitative metric for comparing inherent modality preferen
1. Limited scenario diversity – Both experimental settings focus on geometric shape perception tasks with relatively simple and synthetic visual scenes. This narrow scope limits the generalizability of the findings to more complex, real-world multimodal reasoning scenarios. 2. Lack of downstream validation —— The paper does not demonstrate whether understanding or controlling modality preference improves practical tasks (e.g., VQA or captioning). 3. Interpretability gap —— While oscillation is s
- The finding that the probability of following a modality decreases monotonically as its relative uncertainty increases is interesting! - The paper provides a clear and interpretable decomposition of multimodal decision-making, distinguishing between case-specific uncertainty and model-level bias. - The paper’s analysis is systematic, the hypotheses are well-motivated, and the results are supported by clear empirical trends and visualizations
- Some prior works on semantic bias, language bias might worth discussing. For instance, [1] argue that linguistic priors learned during pre-training can “hack” or dominate visual inference, which appears conceptually related to the present work’s notion of modality preference. - Model selection for Figure 4(a) and 4(b). Different model sets are used in these two subfigures, but it is unclear whether the remaining models exhibit similar behavioral patterns as the three presented ones. - Model i
1. The paper introduces a novel and principled analytical framework that deconstructs the complex phenomenon of modality following into two more fundamental components: relative reasoning uncertainty and inherent modality preference. This approach moves beyond the prior work to offer a more powerful explanatory model, representing a conceptual contribution to understanding multimodal conflict resolution. 2. The work is supported by a relatively rigorous experimental design, centered on a custom-
1. The core findings are predominantly derived from a synthetic "toy dataset" focused on color and attribute recognition of simple geometric shapes. While this controlled setting is ideal for validating fundamental principles, it raises questions about the generalizability of the conclusions to more complex, subtle, and semantic conflicts found in real-world scenarios. It is not yet clear if the framework, especially the stability of a model's "balance point," holds across a wider variety of rea
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems
