ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

Ziyu Guo; Rain Liu; Xinyan Chen; Pheng-Ann Heng

arXiv:2605.15198·cs.CV·May 15, 2026

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

Ziyu Guo, Rain Liu, Xinyan Chen, Pheng-Ann Heng

PDF

1 Datasets

TL;DR

ATLAS introduces a unified framework using functional tokens for visual reasoning that combines agentic and latent approaches, achieving high performance without complex visual content generation.

Contribution

The paper proposes a novel token-based reasoning framework that merges agentic and latent methods, improving efficiency and generalization in visual reasoning tasks.

Findings

01

ATLAS outperforms existing methods on challenging benchmarks.

02

The framework maintains interpretability and compatibility with standard training procedures.

03

Latent-Anchored GRPO stabilizes training of functional tokens.

Abstract

Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

aoiandroid/papers
dataset· 28 dl
28 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.