LACE: Latent Visual Representation for Cross-Embodiment Learning
Yoo Sung Jang, Kanchana Ranasinghe, Cristina Mata, Yichi Zhang, Jorge Mendez-Mendez, Michael S. Ryoo

TL;DR
LACE is a framework that aligns human and robot visual representations in a shared latent space, enabling effective cross-embodiment learning and zero-shot transfer of robot policies using minimal demonstration data.
Contribution
LACE introduces a novel semantic alignment method leveraging shared body part correspondences and self-supervised learning to improve cross-embodiment robot learning.
Findings
Zero-shot transfer policies using LACE-DINO outperform baseline by 65%.
LACE improves performance in low-data and out-of-distribution environments.
Single robot demonstration suffices for training the model.
Abstract
Cross-embodiment learning from human demonstrations is hindered by the visual gap between human and robot embodiments. While self-supervised learning (SSL) backbones encode rich inter-class semantics of general objects, we show they fail to establish correspondence between human and robot hands. We propose LACE, a framework that aligns human and robot visual representations in the latent space of these backbones by leveraging correspondences between shared body parts across embodiments as sparse supervision. These annotations can be automatically obtained via forward kinematics, and single robot demonstration is sufficient to train the model. Our semantic alignment loss matches distributions incurred by corresponding features, lifting patch-level supervision to semantic-level alignment, while a Gram loss preserves pretrained feature quality. This alignment enables robot policies to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
