RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models

Jingzhou Luo; Yifan Wen; Yongjie Bai; Xinshuai Song; Yang Liu; Liang Lin

arXiv:2605.19678·cs.RO·May 20, 2026

RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models

Jingzhou Luo, Yifan Wen, Yongjie Bai, Xinshuai Song, Yang Liu, Liang Lin

PDF

1 Repo

TL;DR

RoVLA introduces a multi-consistency constrained framework for vision-language-action models, enhancing robustness and generalization by enforcing invariance across semantics, evolution, and observations during training.

Contribution

It proposes a novel multi-consistency constraints approach that explicitly models invariances, significantly improving robustness and generalization in embodied manipulation tasks.

Findings

01

Outperforms baseline methods on LIBERO-Plus, RoboTwin 2.0, and real-world tasks.

02

Demonstrates increased robustness under diverse task and observation shifts.

03

Shows that multi-consistency learning reduces reliance on superficial correlations.

Abstract

Vision-Language-Action (VLA) models have shown strong performance on embodied manipulation, yet they remain brittle under visual observation changes, paraphrased language instructions, and compounded perturbations. This limitation suggests that existing methods still rely heavily on shallow correlations in the training distribution, rather than learning stable couplings among task semantics, environment states, and action generation. Although recent efforts improve robustness through larger-scale training, post-training adaptation, or enhanced predictive modeling, they rarely enforce invariance-oriented consistency within the end-to-end policy itself. To address this issue, we propose RoVLA, a robust vision-language-action framework with multi-consistency constraints. RoVLA enforces consistency under three complementary transformations: instruction semantics, trajectory evolution, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HCPLab-SYSU/RoVLA
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.