Self-Supervised Learning with a Multi-Task Latent Space Objective
Pierre-Fran\c{c}ois De Plaen, Abhishek Jha, Luc Van Gool, Tinne Tuytelaars, Marc Proesmans

TL;DR
This paper introduces a multi-task, multi-view self-supervised learning framework that stabilizes training with multi-crop strategies by assigning separate predictors and combining various view types, leading to improved image representations.
Contribution
It proposes a novel multi-task Siamese SSL method that stabilizes multi-crop training and integrates multiple view types for enhanced visual representation learning.
Findings
Stable training across backbones like ResNet and ViT.
Consistent performance improvements on ImageNet.
Effective integration of global, local, and masked views.
Abstract
Self-supervised learning (SSL) methods based on Siamese networks learn visual representations by aligning different views of the same image. The multi-crop strategy, which incorporates small local crops to global ones, enhances many SSL frameworks but causes instability in predictor-based architectures such as BYOL, SimSiam, and MoCo v3. We trace this failure to the shared predictor used across all views and demonstrate that assigning a separate predictor to each view type stabilizes multi-crop training, resulting in significant performance gains. Extending this idea, we treat each spatial transformation as a distinct alignment task and add cutout views, where part of the image is masked before encoding. This yields a simple multi-task formulation of asymmetric Siamese SSL that combines global, local, and masked views into a single framework. The approach is stable, generally applicable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Face recognition and analysis
