VFaith: Do Large Multimodal Models Really Reason on Seen Images Rather than Previous Memories?
Jiachen Yu, Yufei Zhan, Ziheng Wu, Yousong Zhu, Jinqiao Wang, Minghui Qiu

TL;DR
This paper introduces VFaith-Bench, a benchmark and pipeline to evaluate and analyze the visual reasoning faithfulness of multimodal large language models by editing visual cues and measuring performance changes.
Contribution
It presents a novel automatic editing pipeline and a comprehensive benchmark to assess the visual reasoning capabilities and faithfulness of MLLMs.
Findings
Models' reasoning ability is closely linked to visual perception.
Editing visual cues significantly impacts model accuracy.
Analysis reveals the source of reasoning capabilities in MLLMs.
Abstract
Recent extensive works have demonstrated that by introducing long CoT, the capabilities of MLLMs to solve complex problems can be effectively enhanced. However, the reasons for the effectiveness of such paradigms remain unclear. It is challenging to analysis with quantitative results how much the model's specific extraction of visual cues and its subsequent so-called reasoning during inference process contribute to the performance improvements. Therefore, evaluating the faithfulness of MLLMs' reasoning to visual information is crucial. To address this issue, we first present a cue-driven automatic and controllable editing pipeline with the help of GPT-Image-1. It enables the automatic and precise editing of specific visual cues based on the instruction. Furthermore, we introduce VFaith-Bench, the first benchmark to evaluate MLLMs' visual reasoning capabilities and analyze the source of…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Both the motivation of evaluating MLLMs' faithfulness and the proposed VFaith-Bench make sense and are technically sound to me. Without a full ablation based benchmark like the one proposed in the draft, it is difficult to figure out the faithfulness of the responses from MLLMs. 2. The authors conducted extensive experiments on both open-source and proprietary models and show that they all suffer from the lack of visual faithfulness. 3. Writing is good and easy to follow.
1. My main concern is the heavy dependence on MLLMs themself in the curation of the benchmark. While it help automate the pipeline and make it more scalable, it is a bottleneck and capped by the capability of the models used. 2. The benchmark is limited to multiple-choice questions which is only a small portion of the whole spectrum of evaluations.
The major strengths are as follows: S1. This paper is well written and is easy-to-follow. S2. The studied task is interesting and important in the research field. S3. The authors propose a good baseline to make data.
The weaknesses are clear from my point of view. W1. From the methodology side. As a machine-generated benchmark, its quality and diversity are influenced by the machine. No new knowledge will be created. W2. From the statistics side, some core details are also missed. For example, the distribution of the question, answers, comparisons with existing related benchmarks. W3. The difficulty of the benchmark is limited. According to Tab. 2, the Gemini-2.5-pro can correctly answer 84.78% questions.
The paper introduces a unique cue-driven editing pipeline for generating multimodal benchmark data to induce hallucinations and probe reasoning chains. The evaluations revealed deficiencies in visual cue perception and adherence, as well as potential data leakage, providing insights for developing more reliable MLLMs. Unlike prior multimodal reasoning benchmarks (e.g., HallusionBench), this work goes beyond assessing raw reasoning ability — it probes how much of the reasoning is truly grounded
1. The paper does not specify how many visual cues are extracted for each image, which is crucial because the number and granularity of cues directly affect the subsequent image editing process. 2. After large models extract vision cues, there appears to be no human validation or quality control to verify whether these cues are accurate or relevant. Similarly, in Section 4.3, the identification of error reasons (Reason 1–3) is entirely model-driven, without human cross-checking. This lack of hu
1. The paper tackles an important and timely problem: assessing whether MLLMs can faithfully incorporate visual cues into their reasoning process is crucial for advancing multimodal reasoning research. This topic is of clear interest to the ICLR community. 2. VFaith-Bench provides a well-targeted diagnostic benchmark for systematically studying visual faithfulness in MLLMs. 3. The benchmark includes 755 samples spanning six task categories, and the semi-automated image-editing pipeline could be
1. The benchmark’s ability to reveal memorization effects is inherently limited. It can only expose potential memorization if the evaluated models were trained on data overlapping with M3CoT and MegaBench, which together contain around 20k instances. The authors do not verify whether these datasets were part of the training corpora of the evaluated open-source models. If they were not, the benchmark’s capacity to probe faithfulness issues related to memorization would be significantly reduced. 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
