Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation

Yuchen Li; Amanmeet Garg; Shalini Chaudhuri; Rui Zhao; Garin Kessler

arXiv:2603.18795·cs.CV·March 20, 2026

Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation

Yuchen Li, Amanmeet Garg, Shalini Chaudhuri, Rui Zhao, Garin Kessler

PDF

Open Access

TL;DR

Perceptio enhances vision language models with explicit 2D and 3D spatial reasoning capabilities by integrating semantic segmentation and depth tokens, leading to improved spatial understanding and state-of-the-art performance on multiple benchmarks.

Contribution

The paper introduces a novel method to incorporate explicit spatial tokens into LVLMs, enabling better spatial reasoning and grounding.

Findings

01

Improved spatial understanding accuracy by 10.3%.

02

Achieved state-of-the-art results on multiple benchmarks.

03

Demonstrated effectiveness of explicit spatial tokens in LVLMs.

Abstract

Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and (ii) integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning