TL;DR
RoVLA introduces a multi-consistency constrained framework for vision-language-action models, enhancing robustness and generalization by enforcing invariance across semantics, evolution, and observations during training.
Contribution
It proposes a novel multi-consistency constraints approach that explicitly models invariances, significantly improving robustness and generalization in embodied manipulation tasks.
Findings
Outperforms baseline methods on LIBERO-Plus, RoboTwin 2.0, and real-world tasks.
Demonstrates increased robustness under diverse task and observation shifts.
Shows that multi-consistency learning reduces reliance on superficial correlations.
Abstract
Vision-Language-Action (VLA) models have shown strong performance on embodied manipulation, yet they remain brittle under visual observation changes, paraphrased language instructions, and compounded perturbations. This limitation suggests that existing methods still rely heavily on shallow correlations in the training distribution, rather than learning stable couplings among task semantics, environment states, and action generation. Although recent efforts improve robustness through larger-scale training, post-training adaptation, or enhanced predictive modeling, they rarely enforce invariance-oriented consistency within the end-to-end policy itself. To address this issue, we propose RoVLA, a robust vision-language-action framework with multi-consistency constraints. RoVLA enforces consistency under three complementary transformations: instruction semantics, trajectory evolution, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
