VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought

Byeonggeuk Lim; Kyeonghyun Kim; JungMin Yun; YoungBin Kim

arXiv:2604.21396·cs.CV·April 24, 2026

VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought

Byeonggeuk Lim, Kyeonghyun Kim, JungMin Yun, YoungBin Kim

PDF

TL;DR

VG-CoT introduces an automated, grounded reasoning dataset for LVLMs, improving trustworthiness and evaluation across multiple dimensions by linking reasoning steps to visual evidence.

Contribution

It presents a fully automated pipeline to create a grounded reasoning dataset and a new benchmark for evaluating LVLMs' reasoning quality and trustworthiness.

Findings

01

LVLMs show improved reasoning and trustworthiness on the VG-CoT benchmark.

02

The dataset enables scalable, cost-effective evaluation of visual reasoning.

03

Experiments confirm VG-CoT enhances evidence-based reasoning in LVLMs.

Abstract

The advancement of Large Vision-Language Models (LVLMs) requires precise local region-based reasoning that faithfully grounds the model's logic in actual visual evidence. However, existing datasets face limitations in scalability due to extensive manual annotation and lack of explicit alignment between multi-step reasoning and corresponding image regions, which constrains the evaluation of model trustworthiness. To address these challenges, we propose the Visual Grounding Chain-of-Thought (VG-CoT) dataset, which explicitly links each reasoning step to real visual evidence within the image through a fully automated three-stage pipeline. The pipeline first extracts object- and text-level visual evidence using state-of-the-art detection and OCR models, then generates step-by-step grounded reasoning with GPT-4o, and finally refines the grounding through a rationale-driven open-set detection…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.