Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale

David Acuna; Chao-Han Huck Yang; Yuntian Deng; Jaehun Jung; Ximing Lu; Prithviraj Ammanabrolu; Hyunwoo Kim; Yuan-Hong Liao; Yejin Choi

arXiv:2511.05705·cs.CV·February 18, 2026

Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale

David Acuna, Chao-Han Huck Yang, Yuntian Deng, Jaehun Jung, Ximing Lu, Prithviraj Ammanabrolu, Hyunwoo Kim, Yuan-Hong Liao, Yejin Choi

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a large-scale vision-centric dataset and synthesis framework for complex visual reasoning problems, significantly improving multimodal reasoning models' performance across various benchmarks and modalities.

Contribution

A novel two-stage synthesis framework creating over 1 million diverse visual problems, enhancing vision-centric reasoning models and enabling cross-modality transfer.

Findings

01

Finetuning Qwen2.5-VL-7B on the dataset outperforms existing open-data models.

02

The dataset improves reasoning in text-only and audio modalities.

03

High-quality data with reasoning traces is crucial for scaling online RL.

Abstract

Despite rapid progress, multimodal reasoning still lacks a systematic approach to synthesize large-scale vision-centric datasets beyond visual math. We introduce a framework able to synthesize vision-centric problems spanning diverse levels of complexity, and the resulting dataset with over 1M high-quality problems including: reasoning traces, preference data, and instruction prompts supporting SFT, offline and online RL. Our vision-centric synthesis framework uses a two-stage process focusing on: (1) generating diverse verifiable questions from existing images at scale, and (2) creating complex compositional visual problems by merging simpler questions. Remarkably, finetuning Qwen2.5-VL-7B on our data outperforms existing open-data baselines across evaluated vision-centric benchmarks, and our best configurations match or surpass strong closed-data models such as MiMo-VL-7B-RL on Vstar…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

nvidia/nemotron-research-lgt
dataset· 69 dl
69 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications