Learning to Perceive "Where": Spatial Pretext Tasks for Robust Self-Supervised Learning

Yang Shen; Yusen Cai; Weronika Hryniewska-Guzik; Qing Lin; Mengmi Zhang

arXiv:2605.09963·cs.CV·May 12, 2026

Learning to Perceive "Where": Spatial Pretext Tasks for Robust Self-Supervised Learning

Yang Shen, Yusen Cai, Weronika Hryniewska-Guzik, Qing Lin, Mengmi Zhang

PDF

TL;DR

This paper introduces Spatial Prediction, a novel self-supervised learning task that enhances spatial understanding in visual representations by modeling part-to-part relationships, leading to improved performance across various vision tasks.

Contribution

The paper proposes a spatially aware pretext task called Spatial Prediction that models part-to-part relationships, improving spatial reasoning in self-supervised learning frameworks.

Findings

01

Consistent improvements in image recognition and segmentation tasks.

02

Enhanced out-of-distribution robustness for object recognition.

03

Strong performance on spatial reasoning tasks like patch reordering.

Abstract

Existing self-supervised learning (SSL) methods primarily learn object-invariant representations but often neglect the spatial structure and relationships among object parts. To address this limitation, we introduce Spatial Prediction (SP), a spatially aware pretext regression task that predicts the relative position and scale between a pair of disentangled local views from the same image. By modeling part-to-part relationships in a continuous geometric space, SP encourages representations to capture fine-grained spatial dependencies beyond invariant categorical semantics, thereby learning the compositional structure of visual scenes. SP is implemented as a decoupled plug-in and can be seamlessly integrated into diverse SSL frameworks. Extensive experiments show consistent improvements across image recognition, fine-grained classification, semantic segmentation, and depth estimation, as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.