Seeing the Whole in the Parts in Self-Supervised Representation Learning

Arthur Aubret; C\'eline Teuli\`ere; Jochen Triesch

arXiv:2501.02860·cs.LG·January 7, 2025

Seeing the Whole in the Parts in Self-Supervised Representation Learning

Arthur Aubret, C\'eline Teuli\`ere, Jochen Triesch

PDF

Open Access 3 Reviews

TL;DR

This paper introduces CO-SSL, a novel self-supervised learning method that aligns local and global image representations, leading to improved accuracy and robustness over previous approaches, and providing insights into representation redundancy.

Contribution

The paper proposes CO-SSL, a new approach that aligns local and global features in self-supervised learning, outperforming prior methods and enhancing robustness.

Findings

01

Achieves 71.5% Top-1 accuracy on ImageNet-1K with 100 epochs

02

More robust to noise, corruption, and adversarial attacks

03

Learns highly redundant local representations

Abstract

Recent successes in self-supervised learning (SSL) model spatial co-occurrences of visual features either by masking portions of an image or by aggressively cropping it. Here, we propose a new way to model spatial co-occurrences by aligning local representations (before pooling) with a global image representation. We present CO-SSL, a family of instance discrimination methods and show that it outperforms previous methods on several datasets, including ImageNet-1K where it achieves 71.5% of Top-1 accuracy with 100 pre-training epochs. CO-SSL is also more robust to noise corruption, internal corruption, small adversarial attacks, and large training crop sizes. Our analysis further indicates that CO-SSL learns highly redundant local representations, which offers an explanation for its robustness. Overall, our work suggests that aligning local and global representations may be a powerful…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

1. the idea of co-occurrence is important in SSL 2. the title is clear and intuitive

Weaknesses

1. The spatial co-occurrence has been identified as useful before. E.g. in SimCLR cropping has shown it importance. [1] further proposes SSL training on patches. The proposed method should compare with such methods and include a better analysis of why the proposed multiple head is better, in addition to short empirical study in Fig3. 2. The experimental improvement in Table 1 is also limited, which is a) related to point 1; b) potentially indicates that the original design (e.g. BYOL, simclr) im

Reviewer 02Rating 6Confidence 4

Strengths

1. The analytical experiments (robustness, similarities, saliencies) are insightful. From visualizations in Figure 4, CO-SSL works well. 2. The work has promising extensions, and probably can shed light on modifying ViTs to enhance local focus and processing. 3. This paper is trying to make solid contributions as it not only proposes new objectives but also a new architecture, thanks for the hard work.

Weaknesses

1. Referring to Table 2, why applying CO-SSL to DINO does not improve the baseline method as effectively as that for BYOL? This also makes me curious about the effectiveness of CO-SSL on other similar work with two models, e.g., Barlow Twins. I expect CO-SSL to generalize to other SSL methods as well, or the contribution is narrowed. 2. Following (1), while the authors claim that "in principle, the approach is adaptable to most SSL methods,", we don't know the effectiveness and performance. In

Reviewer 03Rating 3Confidence 3

Strengths

1. This work proposes a novel contrastive learning approach, termed CO-SSL, to encourage spatial cooccurrences. The proposed method demonstrates efficiency in pretraining on ImageNet. 2. It is proved that CO-SSL learns stronger local similarities, regardless of the neural architecture.

Weaknesses

**The original contribution comparing to BYOL with multi-crops is limited:** The loss function between local and global embeddings can be treated as BYOL introducing local crops. **Conclusion on RF and Crop Size Relationship:** The conclusion drawn about the relationship between receptive field size and crop size is currently not well-supported. Figure 3(a) shows a general trend but does not clearly demonstrate an inverse correlation. The authors need to strengthen their argument by providing

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning