OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL
Haoxiang Jie, Yaoyuan Yan, Xiangyu Wei, Kailin Wang, Hongjie Yan, Zhiyou Heng, Daocheng Chen

TL;DR
OmniVLA-RL is a novel vision-language-action model that improves spatial understanding and reinforcement learning stability through a Mix-of-Transformers architecture and a flow matching reformulation.
Contribution
It introduces OmniVLA-RL with a Mix-of-Transformers design and Flow-GSPO, addressing spatial perception and training stability issues in VLA models.
Findings
Outperforms existing methods on LIBERO benchmarks
Enhances action precision and training robustness
Achieves decent overall performance
Abstract
Visual-Language-Action (VLA) models represent a paradigm shift in embodied AI, yet existing frameworks often struggle with imprecise spatial perception, suboptimal multimodal fusion, and instability in reinforcement learning. To bridge these gaps, we propose OmniVLA-RL, a novel architecture that leverages a Mix-of-Transformers (MoT) design to synergistically integrate reasoning, spatial, and action experts. Furthermore, we introduce Flow-GSPO, which reformulates flow matching as a Stochastic Differential Equation (SDE) process and integrates it with Group Segmented Policy Optimization (GSPO) to enhance action precision and training robustness. Extensive evaluations on the LIBERO and LIBERO-Plus benchmarks demonstrate that OmniVLA-RL achieves decent overall performance and surpasses mainstream existing methods, effectively overcoming the fundamental limitations of current VLA models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
