UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, Hongyang Li

TL;DR
UniVLA introduces a task-centric latent action framework that leverages internet-scale videos to enable cross-embodiment vision-language-action policies, achieving state-of-the-art results with less data and compute.
Contribution
The paper presents a novel latent action model derived from videos, allowing scalable, transferable robot policies across diverse embodiments and environments.
Findings
State-of-the-art results on multiple benchmarks
Efficient deployment to various robots
Performance improves with heterogeneous data inclusion
Abstract
A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a new framework for learning cross-embodiment vision-language-action (VLA) policies. Our key innovation is to derive task-centric action representations from videos with a latent action model. This enables us to exploit extensive data across a wide spectrum of embodiments and perspectives. To mitigate the effect of task-irrelevant dynamics, we incorporate language instructions and establish a latent action model within the DINO feature space. Learned from internet-scale videos, the generalist…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗qwbu/univla-7bmodel· 444 dl· ♡ 10444 dl♡ 10
- 🤗qwbu/univla-7b-224-sft-liberomodel· ♡ 1♡ 1
- 🤗qwbu/univla-7b-bridge-ptmodel· 2 dl2 dl
- 🤗qwbu/univla-7b-human-ptmodel· 7 dl7 dl
- 🤗qwbu/univla-latent-action-modelmodel· ♡ 4♡ 4
- 🤗qwbu/univla-7b-224-sft-calvinmodel· 12 dl· ♡ 212 dl♡ 2
- 🤗Zhoues/RoboRefer-2B-SFTmodel· 90 dl· ♡ 890 dl♡ 8
- 🤗Zhoues/RoboRefer-2B-Depth-Alignmodel· 6 dl· ♡ 26 dl♡ 2
- 🤗qwbu/univla-iros-manipulation-challenge-baselinemodel· 2 dl2 dl
- 🤗Zhoues/NVILA-2B-Depthmodel· 2 dl· ♡ 22 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Reinforcement Learning in Robotics
MethodsAttention Is All You Need · Layer Normalization · Softmax · Linear Layer · Residual Connection · Multi-Head Attention · Dense Connections · Vision Transformer · self-DIstillation with NO labels
