Unified Vision-Language-Action Model

Yuqi Wang; Xinghang Li; Wenxuan Wang; Junbo Zhang; Yingyan Li; Yuntao Chen; Xinlong Wang; Zhaoxiang Zhang

arXiv:2506.19850·cs.CV·June 25, 2025

Unified Vision-Language-Action Model

Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, Zhaoxiang Zhang

PDF

Open Access 1 Models 3 Reviews

TL;DR

UniVLA is a novel multimodal model that jointly learns vision, language, and actions as token sequences, capturing causal dynamics from videos to improve robotic manipulation and achieve state-of-the-art results.

Contribution

It introduces UniVLA, a unified autoregressive model that incorporates world modeling for better transfer to downstream tasks and long-horizon planning.

Findings

01

Achieves 95.5% success on LIBERO benchmark.

02

Sets new state-of-the-art on CALVIN, LIBERO, and Simplenv-Bridge.

03

Demonstrates effectiveness on real-world manipulation and autonomous driving.

Abstract

Vision-language-action models (VLAs) have garnered significant attention for their potential in advancing robotic manipulation. However, previous approaches predominantly rely on the general comprehension capabilities of vision-language models (VLMs) to generate action signals, often overlooking the rich temporal and causal structure embedded in visual observations. In this paper, we present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences. This formulation enables flexible multimodal tasks learning, particularly from large-scale video data. By incorporating world modeling during post-training, UniVLA captures causal dynamics from videos, facilitating effective transfer to downstream policy learning--especially for long-horizon tasks. Our approach sets new state-of-the-art results across…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper presents a unified VLA framework that discretizes all modalities into tokens, resulting in a shared representation across visual, textual, and action modalities. 2. Comprehensive experiments are conducted in both simulated and real-world scenarios, validating the effectiveness of the proposed framework and analyzing the transferability of different visual post-training strategies. 3. The model is further evaluated beyond robotic manipulation, including autonomous driving tasks, demo

Weaknesses

1. Inference speed – The model is built upon Emu3 with 8.5B parameters and employs next-token prediction for action generation. This results in slow inference and limits its ability to handle tasks requiring high responsiveness. The authors are encouraged to provide detailed inference speed comparisons with other VLAs such as OpenVLA, π₀, and UVA. 2. While the paper presents a strong discrete-token–based autoregressive framework, it lacks systematic comparisons or discussions with continuous-act

Reviewer 02Rating 6Confidence 4

Strengths

1. This paper proposes a novel paradigm-UniVLA-that models all three modalities as discrete tokens within a shared autoregressive sequence. This represents a significant conceptual advance beyond traditional late-fusion or modality-specific architectures, enabling deeper cross-modal interaction and joint representation learning. 2. The proposed world-model post-training via video-based supervision substantially enhances temporal reasoning and data efficiency, demonstrating a clear methodologica

Weaknesses

1. Although the simulation results are impressive, the study lacks a comparison against established baseline models on real-world tasks. Robustness under noisy sensory inputs remain insufficiently validated. 2. Why does employing action prediction within this framework during post-training adversely affect model performance? 3. The 8.5B-parameter model exhibits significant latency in real-world deployment (evidenced by robotic arm stuttering in the video), critically impairing its capacity fo

Reviewer 03Rating 6Confidence 3

Strengths

- The UniVLA architecture is meaningful: it discretizes images, text, and actions into tokens and trains them under a unified autoregressive paradigm. This enables stronger cross-modal alignment via diverse proxy tasks, such as world-model training, policy learning, and multimodal understanding. - The authors demonstrate that world-model training is an effective auxiliary task that benefits downstream policy learning. - UniVLA achieves competitive results on several standard benchmarks, showing

Weaknesses

- Concern about inference speed. The authors use Emu3-8.5B as the autoregressive base, which is non-trivial in size for a VLA backbone. Moreover, action prediction requires generating a sequence of tokens, which inevitably slows inference. This is especially problematic in real-robot experiments, where limited on-device compute leads to a low action-prediction rate. The authors should further discuss UniVLA's deployment in simulation and on real hardware, its baseline compute requirements, and t

Code & Models

Models

🤗
Yuqi1997/UniVLA
model· ♡ 8
♡ 8

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedia, Religion, Digital Communication · Geographic Information Systems Studies · Robotics and Automated Systems