When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMs

Zhuoran Zhang; Tengyue Wang; Xilin Gong; Yang Shi; Haotian Wang; Di Wang; Lijie Hu

arXiv:2511.02243·cs.AI·November 5, 2025

When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMs

Zhuoran Zhang, Tengyue Wang, Xilin Gong, Yang Shi, Haotian Wang, Di Wang, Lijie Hu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a framework to understand how multimodal large language models decide which modality to follow when faced with conflicting information, emphasizing the roles of uncertainty and inherent bias.

Contribution

It proposes a novel decomposition of modality following into relative reasoning uncertainty and inherent modality preference, validated by a controllable dataset and entropy-based metrics.

Findings

01

Follow probability decreases with increasing relative uncertainty.

02

The balance point indicates the model's inherent modality bias.

03

Models oscillate between modalities near the balance point, revealing internal decision mechanisms.

Abstract

Multimodal large language models (MLLMs) must resolve conflicts when different modalities provide contradictory information, a process we term modality following. Prior work measured this behavior only with coarse dataset-level statistics, overlooking the influence of model's confidence in unimodal reasoning. In this paper, we introduce a new framework that decomposes modality following into two fundamental factors: relative reasoning uncertainty (the case-specific confidence gap between unimodal predictions) and inherent modality preference( a model's stable bias when uncertainties are balanced). To validate this framework, we construct a controllable dataset that systematically varies the reasoning difficulty of visual and textual inputs. Using entropy as a fine-grained uncertainty metric, we uncover a universal law: the probability of following a modality decreases monotonically as…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 2

Strengths

1. Novel conceptual framework – Clearly decomposes modality-following behavior into relative uncertainty and inherent preference, offering a principled explanation beyond prior dataset-level analyses. 2. Mechanistic insight – The layer-wise oscillation analysis provides an interpretable link between internal model dynamics and external behavioral indecision. 3. Practical interpretability – The notion of a “balance point” gives a simple quantitative metric for comparing inherent modality preferen

Weaknesses

1. Limited scenario diversity – Both experimental settings focus on geometric shape perception tasks with relatively simple and synthetic visual scenes. This narrow scope limits the generalizability of the findings to more complex, real-world multimodal reasoning scenarios. 2. Lack of downstream validation —— The paper does not demonstrate whether understanding or controlling modality preference improves practical tasks (e.g., VQA or captioning). 3. Interpretability gap —— While oscillation is s

Reviewer 02Rating 6Confidence 5

Strengths

- The finding that the probability of following a modality decreases monotonically as its relative uncertainty increases is interesting! - The paper provides a clear and interpretable decomposition of multimodal decision-making, distinguishing between case-specific uncertainty and model-level bias. - The paper’s analysis is systematic, the hypotheses are well-motivated, and the results are supported by clear empirical trends and visualizations

Weaknesses

- Some prior works on semantic bias, language bias might worth discussing. For instance, [1] argue that linguistic priors learned during pre-training can “hack” or dominate visual inference, which appears conceptually related to the present work’s notion of modality preference. - Model selection for Figure 4(a) and 4(b). Different model sets are used in these two subfigures, but it is unclear whether the remaining models exhibit similar behavioral patterns as the three presented ones. - Model i

Reviewer 03Rating 4Confidence 3

Strengths

1. The paper introduces a novel and principled analytical framework that deconstructs the complex phenomenon of modality following into two more fundamental components: relative reasoning uncertainty and inherent modality preference. This approach moves beyond the prior work to offer a more powerful explanatory model, representing a conceptual contribution to understanding multimodal conflict resolution. 2. The work is supported by a relatively rigorous experimental design, centered on a custom-

Weaknesses

1. The core findings are predominantly derived from a synthetic "toy dataset" focused on color and attribute recognition of simple geometric shapes. While this controlled setting is ideal for validating fundamental principles, it raises questions about the generalizability of the conclusions to more complex, subtle, and semantic conflicts found in real-world scenarios. It is not yet clear if the framework, especially the stability of a model's "balance point," holds across a wider variety of rea

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems