From Edges to Depth: Probing the Spatial Hierarchy in Vision Transformers
Jainum Sanghavi

TL;DR
This paper investigates how Vision Transformers encode spatial information, revealing a hierarchical structure where local boundaries and depth are decoded at different layers, resembling primate visual cortex processing.
Contribution
It uncovers the layered encoding of spatial hierarchy in Vision Transformers trained only on classification tasks, without explicit spatial supervision.
Findings
Boundary structure becomes decodable at layers 5-6 (AP=0.833).
Depth information peaks at layer 8 (MAE=0.0875).
Spatial signals collapse at the final classification layer.
Abstract
Vision Transformers trained only on image classification routinely transfer to tasks that demand spatial understanding, yet they receive no spatial supervision during pretraining. We ask where and how robustly such structure is encoded. Probing a frozen ViT-B/16 layerwise for two complementary properties, local patch boundaries (BSDS500) and per-patch depth (NYU Depth V2), reveals a clear hierarchy: boundary structure becomes linearly decodable at layers 5-6 (AP = 0.833), while depth, which requires integrating global cues, peaks two to three layers later at layer 8 (MAE = 0.0875). Both signals collapse at the final classification layer, and random-weight controls confirm the encodings are learned rather than architectural. Causal interventions add specificity: ablating the single direction a linear depth probe reads degrades depth decoding by up to 165%, while ablating any other…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
