CLASS: Contrastive Learning via Action Sequence Supervision for Robot Manipulation
Sung-Wook Lee, Xuhui Kang, Brandon Yang, Yen-Ling Kuo

TL;DR
This paper introduces CLASS, a contrastive learning method that improves robot manipulation generalization by learning shared action sequence representations, especially under visual shifts, using weak supervision and a contrastive loss.
Contribution
The paper proposes a novel contrastive learning approach, CLASS, that leverages weak supervision from action sequences to enhance robotic manipulation generalization across visual variations.
Findings
CLASS achieves competitive results on simulation benchmarks.
Diffusion Policy with CLASS pre-training attains 75% success rate under visual shifts.
Baseline methods fail to perform well under significant visual variations.
Abstract
Recent advances in Behavior Cloning (BC) have led to strong performance in robotic manipulation, driven by expressive models, sequence modeling of actions, and large-scale demonstration data. However, BC faces significant challenges when applied to heterogeneous datasets, such as visual shift with different camera poses or object appearances, where performance degrades despite the benefits of learning at scale. This stems from BC's tendency to overfit individual demonstrations rather than capture shared structure, limiting generalization. To address this, we introduce Contrastive Learning via Action Sequence Supervision (CLASS), a method for learning behavioral representations from demonstrations using supervised contrastive learning. CLASS leverages weak supervision from similar action sequences identified via Dynamic Time Warping (DTW) and optimizes a soft InfoNCE loss with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Human Motion and Animation
