Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation
Yuchen Li, Amanmeet Garg, Shalini Chaudhuri, Rui Zhao, Garin Kessler

TL;DR
Perceptio enhances vision language models with explicit 2D and 3D spatial reasoning capabilities by integrating semantic segmentation and depth tokens, leading to improved spatial understanding and state-of-the-art performance on multiple benchmarks.
Contribution
The paper introduces a novel method to incorporate explicit spatial tokens into LVLMs, enabling better spatial reasoning and grounding.
Findings
Improved spatial understanding accuracy by 10.3%.
Achieved state-of-the-art results on multiple benchmarks.
Demonstrated effectiveness of explicit spatial tokens in LVLMs.
Abstract
Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and (ii) integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
