Learning to Perceive "Where": Spatial Pretext Tasks for Robust Self-Supervised Learning
Yang Shen, Yusen Cai, Weronika Hryniewska-Guzik, Qing Lin, Mengmi Zhang

TL;DR
This paper introduces Spatial Prediction, a novel self-supervised learning task that enhances spatial understanding in visual representations by modeling part-to-part relationships, leading to improved performance across various vision tasks.
Contribution
The paper proposes a spatially aware pretext task called Spatial Prediction that models part-to-part relationships, improving spatial reasoning in self-supervised learning frameworks.
Findings
Consistent improvements in image recognition and segmentation tasks.
Enhanced out-of-distribution robustness for object recognition.
Strong performance on spatial reasoning tasks like patch reordering.
Abstract
Existing self-supervised learning (SSL) methods primarily learn object-invariant representations but often neglect the spatial structure and relationships among object parts. To address this limitation, we introduce Spatial Prediction (SP), a spatially aware pretext regression task that predicts the relative position and scale between a pair of disentangled local views from the same image. By modeling part-to-part relationships in a continuous geometric space, SP encourages representations to capture fine-grained spatial dependencies beyond invariant categorical semantics, thereby learning the compositional structure of visual scenes. SP is implemented as a decoupled plug-in and can be seamlessly integrated into diverse SSL frameworks. Extensive experiments show consistent improvements across image recognition, fine-grained classification, semantic segmentation, and depth estimation, as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
