Geometry without Position? When Positional Embeddings Help and Hurt Spatial Reasoning
Jian Shi, Michael Birsak, Wenqing Cui, Zhenyu Li, Peter Wonka

TL;DR
This paper investigates the role of positional embeddings in vision transformers, revealing they act as geometric priors that influence spatial reasoning, with experiments showing their impact on multi-view geometric consistency.
Contribution
It introduces token-level diagnostics and provides extensive analysis on how positional embeddings affect geometric structure in ViT representations.
Findings
Positional embeddings serve as geometric priors in ViTs.
They influence multi-view geometric consistency.
Positional embeddings can both help and hinder spatial reasoning.
Abstract
This paper revisits the role of positional embeddings (PEs) within vision transformers (ViTs) from a geometric perspective. We show that PEs are not mere token indices but effectively function as geometric priors that shape the spatial structure of the representation. We introduce token-level diagnostics that measure how multi-view geometric consistency in ViT representation depends on consitent PEs. Through extensive experiments on 14 foundation ViT models, we reveal how PEs influence multi-view geometry and spatial reasoning. Our findings clarify the role of PEs as a causal mechanism that governs spatial structure in ViT representations. Our code is provided in https://github.com/shijianjian/vit-geometry-probes
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpatial Cognition and Navigation · Action Observation and Synchronization · Visual perception and processing mechanisms
