CoRGI: Verified Chain-of-Thought Reasoning with Post-hoc Visual Grounding
Shixin Yi, Lin Shang

TL;DR
CoRGI enhances multimodal reasoning by verifying and grounding chain-of-thought explanations in visual evidence, significantly reducing hallucinations and improving answer accuracy and interpretability across multiple benchmarks and models.
Contribution
This paper introduces CoRGI, a novel post-hoc verification framework that grounds reasoning steps in visual evidence to improve trustworthiness of vision-language models.
Findings
Improves answer accuracy across five benchmarks.
Reduces hallucinations and unsupported claims.
Enhances interpretability and trustworthiness.
Abstract
Multimodal reasoning with vision-language models (VLMs) often suffers from hallucinations, as models tend to generate explanations after only a superficial inspection of the image. We present \textbf{CoRGI}(\textbf{C}hain \textbf{o}f \textbf{R}easoning with \textbf{G}rounded \textbf{I}nsights), a framework that enhances reasoning reliability through post-hoc verification of chain-of-thought outputs. Given a VLM-generated rationale, CoRGI decomposes it into step-wise statements, grounds each step in visual evidence, and filters or corrects unsupported claims before producing the final answer. Experiments on five challenging benchmark-VCR, ScienceQA, MMMU, MathVista, and HallusionBenc-demonstrate that CoRGI consistently improves both answer accuracy and explanation faithfulness across multiple VLM backbones, including Qwen-2.5VL, LLaVA-1.6, and Gemma3-12B. Beyond quantitative gains,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis
