UniVIP: A Unified Framework for Self-Supervised Visual Pre-training
Zhaowen Li, Yousong Zhu, Fan Yang, Wei Li, Chaoyang Zhao, Yingying, Chen, Zhiyang Chen, Jiahao Xie, Liwei Wu, Rui Zhao, Ming Tang, Jinqiao Wang

TL;DR
UniVIP is a versatile self-supervised learning framework that effectively captures scene and instance relationships, achieving state-of-the-art results across multiple visual tasks on diverse datasets.
Contribution
The paper introduces UniVIP, a unified SSL framework that models scene and instance correlations at three levels, improving transfer learning and detection performance.
Findings
Achieves state-of-the-art transfer performance on COCO and ImageNet.
Outperforms BYOL by 2.5% in linear probing.
Surpasses existing self-supervised object detection methods.
Abstract
Self-supervised learning (SSL) holds promise in leveraging large amounts of unlabeled data. However, the success of popular SSL methods has limited on single-centric-object images like those in ImageNet and ignores the correlation among the scene and instances, as well as the semantic difference of instances in the scene. To address the above problems, we propose a Unified Self-supervised Visual Pre-training (UniVIP), a novel self-supervised framework to learn versatile visual representations on either single-centric-object or non-iconic dataset. The framework takes into account the representation learning at three levels: 1) the similarity of scene-scene, 2) the correlation of scene-instance, 3) the discrimination of instance-instance. During the learning, we adopt the optimal transport algorithm to automatically measure the discrimination of instances. Massive experiments show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Neural Network Applications
MethodsBootstrap Your Own Latent
