DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, Feifei Feng

TL;DR
DexVLA introduces a scalable diffusion-based action expert and an embodiment curriculum to enhance vision-language models for versatile, long-horizon robot tasks, enabling rapid adaptation and superior performance across diverse robot embodiments.
Contribution
The paper presents DexVLA, a novel framework with a diffusion-based action expert and curriculum learning for improved generalization and efficiency in vision-language robot control.
Findings
DexVLA outperforms state-of-the-art models in diverse robot tasks.
It enables dexterous skill learning with limited data.
The model adapts to new embodiments without task-specific training.
Abstract
Enabling robots to perform diverse tasks across varied environments is a central challenge in robot learning. While vision-language-action (VLA) models have shown promise for generalizable robot skills, realizing their full potential requires addressing limitations in action representation and efficient training. Current VLA models often focus on scaling the vision-language model (VLM) component, while the action space representation remains a critical bottleneck. This paper introduces DexVLA, a novel framework designed to enhance the efficiency and generalization capabilities of VLAs for complex, long-horizon tasks across diverse robot embodiments. DexVLA features a novel diffusion-based action expert, scaled to one billion parameters, designed for cross-embodiment learning. A novel embodiment curriculum learning strategy facilitates efficient training: (1) pre-training the diffusion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Advanced Neural Network Applications
MethodsDiffusion · Focus
