From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges
Yiming Zhong, Yaoyu He, Zemin Yang, Pengfei Tian, Yifan Huang, Qingqiu Huang, Xinge Zhu, and Yuexin Ma

TL;DR
ResVLA introduces a novel refinement-based generative approach that decouples global intent from local dynamics, improving efficiency, robustness, and convergence in embodied robotic control tasks.
Contribution
It proposes ResVLA, a spectral analysis-based architecture that shifts from noise generation to intent refinement, enhancing generative policy performance and robustness.
Findings
ResVLA achieves competitive performance in simulation.
It demonstrates robustness to language and embodiment perturbations.
ResVLA converges faster than standard generative models.
Abstract
Bridging high-level semantic understanding with low-level physical control remains a persistent challenge in embodied intelligence, stemming from the fundamental spatiotemporal scale mismatch between cognition and action. Existing generative VLA policies typically adopt a "Generation-from-Noise" paradigm, which disregards this disparity, leading to representation inefficiency and weak condition alignment during optimization. In this work, we propose ResVLA, an architecture that shifts the paradigm to "Refinement-from-Intent." Recognizing that robotic motion naturally decomposes into global intent and local dynamics, ResVLA utilizes spectral analysis to decouple control into a deterministic low-frequency anchor and a stochastic high-frequency residual. By anchoring the generative process on the predicted intent, our model focuses strictly on refining local dynamics via a residual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
