MOTIF: Learning Action Motifs for Few-shot Cross-Embodiment Transfer
Heng Zhi, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li, Guoli Yang, Heng Tao Shen

TL;DR
MOTIF introduces a novel approach for few-shot cross-embodiment transfer in robotics by learning embodiment-agnostic action motifs, enabling efficient adaptation across different robot embodiments with minimal data.
Contribution
The paper proposes MOTIF, a method that learns shared action motifs using vector quantization and alignment techniques, facilitating effective cross-embodiment transfer with few demonstrations.
Findings
Outperforms baselines by 6.5% in simulation
Achieves 43.7% improvement in real-world transfer
Validates effectiveness in both simulation and real environments
Abstract
While vision-language-action (VLA) models have advanced generalist robotic learning, cross-embodiment transfer remains challenging due to kinematic heterogeneity and the high cost of collecting sufficient real-world demonstrations to support fine-tuning. Existing cross-embodiment policies typically rely on shared-private architectures, which suffer from limited capacity of private parameters and lack explicit adaptation mechanisms. To address these limitations, we introduce MOTIF for efficient few-shot cross-embodiment transfer that decouples embodiment-agnostic spatiotemporal patterns, termed action motifs, from heterogeneous action data. Specifically, MOTIF first learns unified motifs via vector quantization with progress-aware alignment and embodiment adversarial constraints to ensure temporal and cross-embodiment consistency. We then design a lightweight predictor that predicts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Social Robot Interaction and HRI · Multimodal Machine Learning Applications
