v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning

Jiwan Chung; Junhyeok Kim; Siyeol Kim; Jaeyoung Lee; Min Soo Kim; Youngjae Yu

arXiv:2505.18842·cs.CL·May 8, 2026

v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning

Jiwan Chung, Junhyeok Kim, Siyeol Kim, Jaeyoung Lee, Min Soo Kim, Youngjae Yu

PDF

1 Repo 1 Models 1 Datasets

TL;DR

The paper introduces v1, a model extension enabling active visual referencing by selecting and copying relevant image patches during multimodal reasoning, improving focus and performance.

Contribution

It proposes a novel point-and-copy mechanism for visual grounding in multimodal reasoning models, trained on a large dataset, enhancing interpretability and accuracy.

Findings

01

v1 outperforms baselines on multimodal reasoning benchmarks.

02

The point-and-copy mechanism maintains alignment between visual evidence and reasoning.

03

Training on v1g dataset enables effective learning of visual referencing.

Abstract

When thinking with images, humans rarely rely on a single glance: they revisit visual evidence while reasoning. In contrast, most Multimodal Language Models encode an image once to key-value cache and then reason purely in text, making it hard to re-ground intermediate steps. We empirically confirm this: as reasoning chains lengthen, models progressively lose focus on relevant regions. We introduce v1, a lightweight extension for active visual referencing via point-and-copy: the model selects relevant image patches and copies their embeddings back into the reasoning stream. Crucially, our point-and-copy mechanism retrieves patches using their semantic representations as keys, ensuring perceptual evidence remains aligned with the reasoning space. To train this behavior, we build v1g, a dataset of 300K multimodal reasoning traces with interleaved grounding annotations. Across multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jun297/v1
github

Models

🤗
kjunh/v1-7B
model· 478 dl
478 dl

Datasets

kjunh/v1g-sample
dataset· 75 dl
75 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.