Self-supervised learning of class embeddings from video
Olivia Wiles, A. Sophia Koepke, Andrew Zisserman

TL;DR
This paper presents a self-supervised learning approach for video-based class embeddings that encode pose and shape, enabling effective downstream tasks and achieving state-of-the-art results without supervision.
Contribution
Introduces a hierarchical probabilistic decoder for learning class-specific embeddings from videos, generalizing across deformable object classes and outperforming existing self-supervised methods.
Findings
Achieves state-of-the-art performance on multiple deformable object classes.
Embeddings generalize well across different domains.
Approaches supervised performance levels without using labels.
Abstract
This work explores how to use self-supervised learning on videos to learn a class-specific image embedding that encodes pose and shape information. At train time, two frames of the same video of an object class (e.g. human upper body) are extracted and each encoded to an embedding. Conditioned on these embeddings, the decoder network is tasked to transform one frame into another. To successfully perform long range transformations (e.g. a wrist lowered in one image should be mapped to the same wrist raised in another), we introduce a hierarchical probabilistic network decoder model. Once trained, the embedding can be used for a variety of downstream tasks and domains. We demonstrate our approach quantitatively on three distinct deformable object classes -- human full bodies, upper bodies, faces -- and show experimentally that the learned embeddings do indeed generalise. They achieve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
