ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
Ziyu Guo, Rain Liu, Xinyan Chen, Pheng-Ann Heng

TL;DR
ATLAS introduces a unified framework using functional tokens for visual reasoning that combines agentic and latent approaches, achieving high performance without complex visual content generation.
Contribution
The paper proposes a novel token-based reasoning framework that merges agentic and latent methods, improving efficiency and generalization in visual reasoning tasks.
Findings
ATLAS outperforms existing methods on challenging benchmarks.
The framework maintains interpretability and compatibility with standard training procedures.
Latent-Anchored GRPO stabilizes training of functional tokens.
Abstract
Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
