Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts

Zhihao Zhu; Jiafeng Liang; Shixin Jiang; Jinlan Fu; Ming Liu; Guanglu Sun; See-Kiong Ng; Bing Qin

arXiv:2601.04073·cs.CV·January 8, 2026

Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts

Zhihao Zhu, Jiafeng Liang, Shixin Jiang, Jinlan Fu, Ming Liu, Guanglu Sun, See-Kiong Ng, Bing Qin

PDF

Open Access

TL;DR

This paper investigates the reasoning robustness of large multimodal models, revealing their tendency to propagate textual hallucinations and proposing a visual grounding method to improve their accuracy and reliability.

Contribution

It introduces the LogicGraph Perturbation Protocol for evaluating reasoning robustness and proposes Active Visual-Context Refinement to mitigate hallucinations without retraining.

Findings

01

Models succeed in self-correction in less than 10% of cases

02

Textual hallucinations tend to propagate blindly in reasoning chains

03

Proposed method significantly reduces hallucination propagation and improves robustness

Abstract

Large Multimodal Models (LMMs) have demonstrated impressive capabilities in video reasoning via Chain-of-Thought (CoT). However, the robustness of their reasoning chains remains questionable. In this paper, we identify a critical failure mode termed textual inertia, where once a textual hallucination occurs in the thinking process, models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence. To systematically investigate this, we propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs spanning both native reasoning architectures and prompt-driven paradigms to evaluate their self-reflection capabilities. The results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation. To mitigate this, we introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)