RotVLA: Rotational Latent Action for Vision-Language-Action Model
Qiwei Li, Xicheng Gong, Xinghang Li, Peiyan Li, Quanyun Zhou, Hangjun Ye, Jiahuan Zhou, Yadong Mu

TL;DR
RotVLA introduces a continuous rotational latent action space for vision-language-action models, enhancing representational capacity and real-world applicability in robotic manipulation tasks.
Contribution
It proposes a novel rotational latent action representation modeled as SO(n), improving structure and dynamics understanding in VLA models.
Findings
Achieves 98.2% on LIBERO benchmark
Outperforms existing VLA models on manipulation tasks
Pretrained on 1700+ hours of data with 1.7B parameters
Abstract
Latent Action Models (LAMs) have emerged as an effective paradigm for handling heterogeneous datasets during Vision-Language-Action (VLA) model pretraining, offering a unified action space across embodiments. However, existing LAMs often rely on discrete quantization encode and decode pipelines, which can lead to trivial frame reconstruction behavior, limited representational capacity, and a lack of physically meaningful structure. We introduce RotVLA, a VLA framework built on a continuous rotational latent action representation. Latent actions are modeled as elements of SO(n), providing continuity, compositionality, and structured geometry aligned with real-world action dynamics. A triplet frame learning framework further enforces meaningful temporal dynamics while avoiding degeneration. RotVLA consists of a VLM backbone and a flow-matching action head, pretrained on large-scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
