RotVLA: Rotational Latent Action for Vision-Language-Action Model

Qiwei Li; Xicheng Gong; Xinghang Li; Peiyan Li; Quanyun Zhou; Hangjun Ye; Jiahuan Zhou; Yadong Mu

arXiv:2605.13403·cs.RO·May 14, 2026

RotVLA: Rotational Latent Action for Vision-Language-Action Model

Qiwei Li, Xicheng Gong, Xinghang Li, Peiyan Li, Quanyun Zhou, Hangjun Ye, Jiahuan Zhou, Yadong Mu

PDF

TL;DR

RotVLA introduces a continuous rotational latent action space for vision-language-action models, enhancing representational capacity and real-world applicability in robotic manipulation tasks.

Contribution

It proposes a novel rotational latent action representation modeled as SO(n), improving structure and dynamics understanding in VLA models.

Findings

01

Achieves 98.2% on LIBERO benchmark

02

Outperforms existing VLA models on manipulation tasks

03

Pretrained on 1700+ hours of data with 1.7B parameters

Abstract

Latent Action Models (LAMs) have emerged as an effective paradigm for handling heterogeneous datasets during Vision-Language-Action (VLA) model pretraining, offering a unified action space across embodiments. However, existing LAMs often rely on discrete quantization encode and decode pipelines, which can lead to trivial frame reconstruction behavior, limited representational capacity, and a lack of physically meaningful structure. We introduce RotVLA, a VLA framework built on a continuous rotational latent action representation. Latent actions are modeled as elements of SO(n), providing continuity, compositionality, and structured geometry aligned with real-world action dynamics. A triplet frame learning framework further enforces meaningful temporal dynamics while avoiding degeneration. RotVLA consists of a VLM backbone and a flow-matching action head, pretrained on large-scale…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.