EVOLVE-VLA: Test-Time Training from Environment Feedback for Vision-Language-Action Models
Zechen Bai, Chen Gao, Mike Zheng Shou

TL;DR
EVOLVE-VLA introduces a test-time training framework for vision-language-action models that enables continuous adaptation through environment feedback, significantly improving performance and generalization without extensive demonstrations.
Contribution
The paper presents EVOLVE-VLA, a novel framework that replaces oracle rewards with learned feedback, allowing VLAs to adapt during deployment with minimal supervision.
Findings
+8.6% on long-horizon tasks
+22.0% in 1-shot learning
20.8% success on unseen tasks without task-specific demonstrations
Abstract
Achieving truly adaptive embodied intelligence requires agents that learn not just by imitating static demonstrations, but by continuously improving through environmental interaction, which is akin to how humans master skills through practice. Vision-Language-Action (VLA) models have advanced robotic manipulation by leveraging large language models, yet remain fundamentally limited by Supervised Finetuning (SFT): requiring hundreds of demonstrations per task, rigidly memorizing trajectories, and failing to adapt when deployment conditions deviate from training. We introduce EVOLVE-VLA, a test-time training framework enabling VLAs to continuously adapt through environment interaction with minimal or zero task-specific demonstrations. The key technical challenge is replacing oracle reward signals (unavailable at test time) with autonomous feedback. We address this through a learned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Reinforcement Learning in Robotics
