VGR: Visual Grounded Reasoning

Jiacong Wang; Zijian Kang; Haochen Wang; Haiyong Jiang; Jiawen Li; Bohong Wu; Ya Wang; Jiao Ran; Xiao Liang; Chao Feng; Jun Xiao

arXiv:2506.11991·cs.CV·May 4, 2026

VGR: Visual Grounded Reasoning

Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, Jun Xiao

PDF

1 Models 1 Datasets 1 Video

TL;DR

VGR is a multimodal reasoning model that enhances visual perception by grounding reasoning in image regions, improving performance on complex visual tasks with fewer image tokens.

Contribution

Introduces VGR, a multimodal large language model with fine-grained visual grounding and a new dataset for multimodal reasoning tasks.

Findings

01

VGR outperforms baselines on multiple visual reasoning benchmarks.

02

VGR achieves higher scores with only 30% of the image tokens.

03

The model effectively integrates visual regions into reasoning processes.

Abstract

In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure language space, which inherently suffers from language bias and is largely confined to math or science domains. This narrow focus limits their ability to handle complex visual reasoning tasks that demand comprehensive understanding of image details. To address these limitations, this paper introduces VGR, a novel reasoning multimodal large language model (MLLM) with enhanced fine-grained visual perception capabilities. Unlike traditional MLLMs that answer the question or reasoning solely on the language space, our VGR first detects relevant regions that may help to solve problems, and then provides precise answers based on replayed image regions. To achieve this, we conduct a large-scale SFT dataset called VGR -SFT that contains reasoning data with mixed vision…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
BytedanceDouyinContent/VGR
model· ♡ 4
♡ 4

Datasets

BytedanceDouyinContent/VGR
dataset· 344 dl
344 dl

Videos

VGR: Visual Grounded Reasoning· slideslive