CASTing Your Model: Learning to Localize Improves Self-Supervised Representations
Ramprasaath R. Selvaraju, Karan Desai, Justin Johnson, Nikhil Naik

TL;DR
This paper introduces CAST, a method that enhances self-supervised learning for complex scene images by using saliency-based sampling and attention supervision, leading to better feature representations and robustness.
Contribution
CAST is a novel approach that incorporates unsupervised saliency and attention loss to improve SSL on scene images, addressing limitations of existing methods.
Findings
CAST improves SSL feature quality on COCO scene images.
Models trained with CAST are more robust to background changes.
CAST significantly outperforms baseline SSL methods on complex scenes.
Abstract
Recent advances in self-supervised learning (SSL) have largely closed the gap with supervised ImageNet pretraining. Despite their success these methods have been primarily applied to unlabeled ImageNet images, and show marginal gains when trained on larger sets of uncurated images. We hypothesize that current SSL methods perform best on iconic images, and struggle on complex scene images with many objects. Analyzing contrastive SSL methods shows that they have poor visual grounding and receive poor supervisory signal when trained on scene images. We propose Contrastive Attention-Supervised Tuning(CAST) to overcome these limitations. CAST uses unsupervised saliency maps to intelligently sample crops, and to provide grounding supervision via a Grad-CAM attention loss. Experiments on COCO show that CAST significantly improves the features learned by SSL methods on scene images, and further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
