VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought
Byeonggeuk Lim, Kyeonghyun Kim, JungMin Yun, YoungBin Kim

TL;DR
VG-CoT introduces an automated, grounded reasoning dataset for LVLMs, improving trustworthiness and evaluation across multiple dimensions by linking reasoning steps to visual evidence.
Contribution
It presents a fully automated pipeline to create a grounded reasoning dataset and a new benchmark for evaluating LVLMs' reasoning quality and trustworthiness.
Findings
LVLMs show improved reasoning and trustworthiness on the VG-CoT benchmark.
The dataset enables scalable, cost-effective evaluation of visual reasoning.
Experiments confirm VG-CoT enhances evidence-based reasoning in LVLMs.
Abstract
The advancement of Large Vision-Language Models (LVLMs) requires precise local region-based reasoning that faithfully grounds the model's logic in actual visual evidence. However, existing datasets face limitations in scalability due to extensive manual annotation and lack of explicit alignment between multi-step reasoning and corresponding image regions, which constrains the evaluation of model trustworthiness. To address these challenges, we propose the Visual Grounding Chain-of-Thought (VG-CoT) dataset, which explicitly links each reasoning step to real visual evidence within the image through a fully automated three-stage pipeline. The pipeline first extracts object- and text-level visual evidence using state-of-the-art detection and OCR models, then generates step-by-step grounded reasoning with GPT-4o, and finally refines the grounding through a rationale-driven open-set detection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
