OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL

Haoxiang Jie; Yaoyuan Yan; Xiangyu Wei; Kailin Wang; Hongjie Yan; Zhiyou Heng; Daocheng Chen

arXiv:2604.17706·cs.RO·April 27, 2026

OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL

Haoxiang Jie, Yaoyuan Yan, Xiangyu Wei, Kailin Wang, Hongjie Yan, Zhiyou Heng, Daocheng Chen

PDF

TL;DR

OmniVLA-RL is a novel vision-language-action model that improves spatial understanding and reinforcement learning stability through a Mix-of-Transformers architecture and a flow matching reformulation.

Contribution

It introduces OmniVLA-RL with a Mix-of-Transformers design and Flow-GSPO, addressing spatial perception and training stability issues in VLA models.

Findings

01

Outperforms existing methods on LIBERO benchmarks

02

Enhances action precision and training robustness

03

Achieves decent overall performance

Abstract

Visual-Language-Action (VLA) models represent a paradigm shift in embodied AI, yet existing frameworks often struggle with imprecise spatial perception, suboptimal multimodal fusion, and instability in reinforcement learning. To bridge these gaps, we propose OmniVLA-RL, a novel architecture that leverages a Mix-of-Transformers (MoT) design to synergistically integrate reasoning, spatial, and action experts. Furthermore, we introduce Flow-GSPO, which reformulates flow matching as a Stochastic Differential Equation (SDE) process and integrates it with Group Segmented Policy Optimization (GSPO) to enhance action precision and training robustness. Extensive evaluations on the LIBERO and LIBERO-Plus benchmarks demonstrate that OmniVLA-RL achieves decent overall performance and surpasses mainstream existing methods, effectively overcoming the fundamental limitations of current VLA models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.