UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu; Yanting Yang; Jisong Cai; Shenyuan Gao; Guanghui Ren; Maoqing Yao; Ping Luo; Hongyang Li

arXiv:2505.06111·cs.RO·November 4, 2025

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, Hongyang Li

PDF

Open Access 1 Repo 10 Models

TL;DR

UniVLA introduces a task-centric latent action framework that leverages internet-scale videos to enable cross-embodiment vision-language-action policies, achieving state-of-the-art results with less data and compute.

Contribution

The paper presents a novel latent action model derived from videos, allowing scalable, transferable robot policies across diverse embodiments and environments.

Findings

01

State-of-the-art results on multiple benchmarks

02

Efficient deployment to various robots

03

Performance improves with heterogeneous data inclusion

Abstract

A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a new framework for learning cross-embodiment vision-language-action (VLA) policies. Our key innovation is to derive task-centric action representations from videos with a latent action model. This enables us to exploit extensive data across a wide spectrum of embodiments and perspectives. To mitigate the effect of task-irrelevant dynamics, we incorporate language instructions and establish a latent action model within the DINO feature space. Learned from internet-scale videos, the generalist…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

opendrivelab/univla
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Reinforcement Learning in Robotics

MethodsAttention Is All You Need · Layer Normalization · Softmax · Linear Layer · Residual Connection · Multi-Head Attention · Dense Connections · Vision Transformer · self-DIstillation with NO labels