DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

Junjie Wen; Yichen Zhu; Jinming Li; Zhibin Tang; Chaomin Shen; Feifei Feng

arXiv:2502.05855·cs.RO·August 12, 2025

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, Feifei Feng

PDF

Open Access 1 Repo

TL;DR

DexVLA introduces a scalable diffusion-based action expert and an embodiment curriculum to enhance vision-language models for versatile, long-horizon robot tasks, enabling rapid adaptation and superior performance across diverse robot embodiments.

Contribution

The paper presents DexVLA, a novel framework with a diffusion-based action expert and curriculum learning for improved generalization and efficiency in vision-language robot control.

Findings

01

DexVLA outperforms state-of-the-art models in diverse robot tasks.

02

It enables dexterous skill learning with limited data.

03

The model adapts to new embodiments without task-specific training.

Abstract

Enabling robots to perform diverse tasks across varied environments is a central challenge in robot learning. While vision-language-action (VLA) models have shown promise for generalizable robot skills, realizing their full potential requires addressing limitations in action representation and efficient training. Current VLA models often focus on scaling the vision-language model (VLM) component, while the action space representation remains a critical bottleneck. This paper introduces DexVLA, a novel framework designed to enhance the efficiency and generalization capabilities of VLAs for complex, long-horizon tasks across diverse robot embodiments. DexVLA features a novel diffusion-based action expert, scaled to one billion parameters, designed for cross-embodiment learning. A novel embodiment curriculum learning strategy facilitates efficient training: (1) pre-training the diffusion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

juruobenruo/DexVLA
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Advanced Neural Network Applications

MethodsDiffusion · Focus