CLASS: Contrastive Learning via Action Sequence Supervision for Robot Manipulation

Sung-Wook Lee; Xuhui Kang; Brandon Yang; Yen-Ling Kuo

arXiv:2508.01600·cs.RO·August 5, 2025

CLASS: Contrastive Learning via Action Sequence Supervision for Robot Manipulation

Sung-Wook Lee, Xuhui Kang, Brandon Yang, Yen-Ling Kuo

PDF

Open Access

TL;DR

This paper introduces CLASS, a contrastive learning method that improves robot manipulation generalization by learning shared action sequence representations, especially under visual shifts, using weak supervision and a contrastive loss.

Contribution

The paper proposes a novel contrastive learning approach, CLASS, that leverages weak supervision from action sequences to enhance robotic manipulation generalization across visual variations.

Findings

01

CLASS achieves competitive results on simulation benchmarks.

02

Diffusion Policy with CLASS pre-training attains 75% success rate under visual shifts.

03

Baseline methods fail to perform well under significant visual variations.

Abstract

Recent advances in Behavior Cloning (BC) have led to strong performance in robotic manipulation, driven by expressive models, sequence modeling of actions, and large-scale demonstration data. However, BC faces significant challenges when applied to heterogeneous datasets, such as visual shift with different camera poses or object appearances, where performance degrades despite the benefits of learning at scale. This stems from BC's tendency to overfit individual demonstrations rather than capture shared structure, limiting generalization. To address this, we introduce Contrastive Learning via Action Sequence Supervision (CLASS), a method for learning behavioral representations from demonstrations using supervised contrastive learning. CLASS leverages weak supervision from similar action sequences identified via Dynamic Time Warping (DTW) and optimizes a soft InfoNCE loss with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Human Motion and Animation