EvoVLA: Self-Evolving Vision-Language-Action Model
Zeting Liu, Zida Yang, Zeyu Zhang, Hao Tang

TL;DR
EvoVLA introduces a self-supervised framework for long-horizon robotic manipulation that reduces stage hallucination, improves success rates, and enhances sim-to-real transfer through novel components like SAR, POE, and Long-Horizon Memory.
Contribution
EvoVLA presents a new self-supervised VLA model with three key components to address stage hallucination and improve long-horizon manipulation performance.
Findings
Achieves 10.2% higher success rate than baseline on Discoverse-L.
Reduces stage hallucination from 38.5% to 14.8%.
Outperforms baseline in real-world robot experiments with 54.6% success.
Abstract
Long-horizon robotic manipulation remains challenging for Vision-Language-Action (VLA) models despite recent progress in zero-shot generalization and simulation-to-real-world transfer. Current VLA models suffer from stage hallucination, where agents exploit coarse evaluation signals to shortcut multi-step tasks, reporting high progress without truly completing them. We present EvoVLA, a self-supervised VLA framework that addresses this issue through three complementary components: Stage-Aligned Reward (SAR), which uses triplet contrastive learning with Gemini-generated hard negatives to prevent visual shortcuts; Pose-Based Object Exploration (POE), which grounds curiosity in relative object-gripper pose instead of raw pixels; and Long-Horizon Memory, which uses selective context retention and gated fusion to stabilize intrinsic shaping during extended rollouts. Extensive evaluations on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Advanced Neural Network Applications
