Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation

Hengyi Wang; Ruiqiang Zhang; Chang Liu; Guanjie Wang; Zehua Ma; Han Fang; Weiming Zhang

arXiv:2602.05789·cs.CV·February 6, 2026

Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation

Hengyi Wang, Ruiqiang Zhang, Chang Liu, Guanjie Wang, Zehua Ma, Han Fang, Weiming Zhang

PDF

Open Access

TL;DR

Allocentric Perceiver is a training-free method that improves spatial reasoning in vision-language models by explicitly reconstructing 3D geometry and aligning reference frames, leading to significant gains on allocentric tasks.

Contribution

It introduces a novel, training-free approach that disentangles allocentric reasoning from egocentric priors by using geometric experts and frame instantiation.

Findings

01

Achieves ~10% improvement on allocentric spatial reasoning benchmarks.

02

Maintains strong egocentric performance.

03

Outperforms existing spatial perception models.

Abstract

With the rising need for spatially grounded tasks such as Vision-Language Navigation/Action, allocentric perception capabilities in Vision-Language Models (VLMs) are receiving growing focus. However, VLMs remain brittle on allocentric spatial queries that require explicit perspective shifts, where the answer depends on reasoning in a target-centric frame rather than the observed camera view. Thus, we introduce Allocentric Perceiver, a training-free strategy that recovers metric 3D states from one or more images with off-the-shelf geometric experts, and then instantiates a query-conditioned allocentric reference frame aligned with the instruction's semantic intent. By deterministically transforming reconstructed geometry into the target frame and prompting the backbone VLM with structured, geometry-grounded representations, Allocentric Perceriver offloads mental rotation from implicit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Constraint Satisfaction and Optimization