Embodiment Transfer Learning for Vision-Language-Action Models
Chengmeng Li, Yaxin Peng

TL;DR
This paper introduces ET-VLA, a transfer learning framework for vision-language-action models that uses synthetic data pretraining and graph-based task modeling to improve multi-robot collaboration, validated on real robots.
Contribution
The paper proposes a novel embodiment transfer learning framework with Synthetic Continued Pretraining and a Graph-of-Thought technique for multi-robot VLA models, reducing data needs and enhancing performance.
Findings
ET-VLA outperforms OpenVLA by over 53% on real-world tasks.
Synthetic pretraining enables effective transfer without real demonstrations.
The approach improves multi-robot collaboration in simulation and real robots.
Abstract
Vision-language-action (VLA) models have significantly advanced robotic learning, enabling training on large-scale, cross-embodiment data and fine-tuning for specific robots. However, state-of-the-art autoregressive VLAs struggle with multi-robot collaboration. We introduce embodiment transfer learning, denoted as ET-VLA, a novel framework for efficient and effective transfer of pre-trained VLAs to multi-robot. ET-VLA's core is Synthetic Continued Pretraining (SCP), which uses synthetically generated data to warm up the model for the new embodiment, bypassing the need for real human demonstrations and reducing data collection costs. SCP enables the model to learn correct actions and precise action token numbers. Following SCP, the model is fine-tuned on target embodiment data. To further enhance the model performance on multi-embodiment, we present the Embodied Graph-of-Thought…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Social Robot Interaction and HRI
