Is Cognition Consistent with Perception? Assessing and Mitigating Multimodal Knowledge Conflicts in Document Understanding
Zirui Shao, Feiyu Gao, Zhaoqing Zhu, Chuwei Luo, Hangdi Xing, Zhi Yu, Qi Zheng, Ming Yan, Jiajun Bu

TL;DR
This paper investigates conflicts between perception and cognition in multimodal large language models during document understanding, revealing significant inconsistencies and proposing a fine-tuning method to improve their knowledge alignment and performance.
Contribution
It defines and systematically assesses cognition-perception conflicts in MLLMs and introduces a novel fine-tuning approach to mitigate these conflicts and enhance model performance.
Findings
GPT-4o achieves only 75.26% C&P consistency.
The proposed fine-tuning reduces knowledge conflicts across models.
Improved performance in cognitive and perceptual tasks.
Abstract
Multimodal large language models (MLLMs) have shown impressive capabilities in document understanding, a rapidly growing research area with significant industrial demand. As a multimodal task, document understanding requires models to possess both perceptual and cognitive abilities. However, due to different types of annotation noise in training, current MLLMs often face conflicts between perception and cognition. Taking a document VQA task (cognition) as an example, an MLLM might generate answers that do not match the corresponding visual content identified by its OCR (perception). This conflict suggests that the MLLM might struggle to establish an intrinsic connection between the information it "sees" and what it "understands". Such conflicts challenge the intuitive notion that cognition is consistent with perception, hindering the performance and explainability of MLLMs. In this…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. This work highlights a noteworthy issue: an (M)LLM's factual understanding of identical content can vary depending on contextual differences. 2. The joint OCR-VQA dataset proposed in this paper may benefit future research in related topics.
1. Insufficient Experimental Results on C&P Conflict Causes and Impact: The paper does not provide extensive experimental results on the root causes of C&P conflicts or a detailed analysis of their impact on downstream tasks. While it demonstrates the existence of these conflicts and the effectiveness of the proposed method in reducing them, a deeper investigation into why these conflicts occur and how they specifically affect performance in various tasks could strengthen the paper's contributio
1. **Clear Experimental Evidence:** The study provides a clear statistical presentation of answer inconsistencies in MLLM responses when different question formats are used, effectively highlighting the inconsistency issue in MLLMs. 2. **Definition of C&P Knowledge Conflicts:** The paper attempts to systematically define these inconsistencies as *Cognition and Perception (C&P) knowledge conflicts*, offering a detailed definition of the term. 3. **Systematic Evaluation of Existing Models:** It
1. **Lack of Robust Explanation for C&P Conflicts:** The attribution of answer inconsistencies across different questioning formats solely to Cognition and Perception (C&P) knowledge conflicts lacks a thorough explanation. The authors could strengthen this claim by including experiments that control for potential randomness in MLLM responses. For example, they could ask the same question multiple times with identical phrasing to observe whether inconsistencies persist under uniform questioning f
There are several noticeable strengths in this paper: * This paper sets eye on a new research area and presents a well-justified motivation for examining the conflicts between cognition and perception knowledge of MLLMs. * This submission conducts extensive experiments to examine current open-source and closed-source MLLMs performance on the cognition and perception knowledge conflict problems, and to verify the effectiveness of the proposed finetuning method. * The manuscript is commendable for
My main criticism for this submission is on its experimental design: * **Limited Reproducibility**: Although the submission provides links to the model and weights, details about the hyperparameters used during inference are missing (e.g., values for top_p, top_k, temperature, beam search, etc). This lack of transparency makes it difficult to ensure fair comparisons across models and affects reproducibility. * **Lack of Repeated Experiments**: The authors did not report the mean and standard de
- This paper introduces a new concept, Cognition and Perception Knowledge Conflicts, to investigate the behavioral discrepancy between the cognitive and perceptual abilities of MLLMs. - This paper proposes a tuning method to mitigate the conflict by constructing a new dataset for training.
- The concept of Cognition and Perception knowledge conflicts is not entirely convincing: - The task used to measure the model's cognitive abilities does not seem to require much cognitive processing. - The tasks used in these experiments are limited in scope. - It is unclear whether the low C&P consistencies arise from C&P knowledge conflicts or simply from poor model performance: - The varying consistency scores across different datasets in Table 2 suggest that the conflict may not be a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Educational Strategies and Epistemologies · Language, Metaphor, and Cognition
MethodsFocus
