Contrastive Learning from Demonstrations
Andr\'e Correia, Lu\'is A. Alexandre

TL;DR
This paper introduces a contrastive self-supervised learning framework for extracting visual representations from unlabeled multi-view videos, improving robotic task imitation and reducing training time.
Contribution
It applies contrastive learning to multi-view video demonstrations for robotic tasks, enhancing representation quality and training efficiency.
Findings
Improved viewpoint alignment and stage classification accuracy.
Enhanced reinforcement learning performance.
Reduced training iterations compared to state-of-the-art methods.
Abstract
This paper presents a framework for learning visual representations from unlabeled video demonstrations captured from multiple viewpoints. We show that these representations are applicable for imitating several robotic tasks, including pick and place. We optimize a recently proposed self-supervised learning algorithm by applying contrastive learning to enhance task-relevant information while suppressing irrelevant information in the feature embeddings. We validate the proposed method on the publicly available Multi-View Pouring and a custom Pick and Place data sets and compare it with the TCN triplet baseline. We evaluate the learned representations using three metrics: viewpoint alignment, stage classification and reinforcement learning, and in all cases the results improve when compared to state-of-the-art approaches, with the added benefit of reduced number of training iterations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Robotics and Sensor-Based Localization · Multimodal Machine Learning Applications
MethodsContrastive Learning
