Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations
Mohammadreza Salehi, Efstratios Gavves, Cees G. M. Snoek, Yuki M., Asano

TL;DR
This paper introduces time-tuning, a self-supervised method that leverages temporal consistency in videos to enhance dense image representations, improving unsupervised segmentation performance on both videos and images.
Contribution
It proposes a novel temporal-alignment clustering loss for self-supervised learning, effectively transferring information from videos to improve image representations.
Findings
Improves unsupervised semantic segmentation by 8-10% on videos
Matches state-of-the-art performance on images
Leverages abundant video data for self-supervised learning
Abstract
Spatially dense self-supervised learning is a rapidly growing problem domain with promising applications for unsupervised segmentation and pretraining for dense downstream tasks. Despite the abundance of temporal data in the form of videos, this information-rich source has been largely overlooked. Our paper aims to address this gap by proposing a novel approach that incorporates temporal consistency in dense self-supervised learning. While methods designed solely for images face difficulties in achieving even the same performance on videos, our method improves not only the representation quality for videos-but also images. Our approach, which we call time-tuning, starts from image-pretrained models and fine-tunes them with a novel self-supervised temporal-alignment clustering loss on unlabeled videos. This effectively facilitates the transfer of high-level information from videos to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
