EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control
Chushan Zhang, Ruihan Lu, Jinguang Tong, Xuesong Li, Yikai Wang, Hongdong Li

TL;DR
EvoScene-VLA introduces a persistent, action-updated scene state within chunked robot control policies, enhancing multi-step robot task success by integrating recent actions and visual data.
Contribution
It proposes a recurrent scene representation that maintains geometry-aware scene priors across control chunks, improving robot control performance.
Findings
Increases average success rate from 87.2% to 89.1% on RoboTwin tasks.
Outperforms all baselines on the Galaxea R1-Lite robot.
Effectively integrates scene updates with visual observations during control.
Abstract
Chunked vision-language-action (VLA) policies predict multi-step robot controls, conditioning each update on the current visual observation alone. Yet robot actions cause contact, occlusion, and object motion, and the geometry that later decisions depend on can change before the next visual update arrives. Spatial VLAs improve current-frame geometry. Temporal VLAs aggregate past frames. Neither maintains an action-updated scene prior across chunks. We argue for a persistent action-updated scene state across control calls, and introduce EvoScene-VLA. Its recurrent scene prefix carries a geometry-aware scene state across chunks. At each vision-language model (VLM) call, the VLM combines scene information from the current observation with the action-updated prior from the previous chunk; the action decoder outputs both the next action chunk and a compact scene update. This update becomes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
