GEVRM: Goal-Expressive Video Generation Model For Robust Visual Manipulation
Hongyin Zhang, Pengxiang Ding, Shangke Lyu, Ying Peng, Donglin Wang

TL;DR
GEVRM is a novel goal-expressive video generation model that enhances the robustness of vision-language-action systems in robots by integrating internal model control principles and perturbation evaluation.
Contribution
It introduces a closed-loop VLA framework with a text-guided video generator and perturbation inference, improving robustness against external disturbances.
Findings
Achieves state-of-the-art results on CALVIN benchmarks.
Significantly improves performance in realistic robot tasks.
Effectively distinguishes external perturbations through internal embeddings.
Abstract
With the rapid development of embodied artificial intelligence, significant progress has been made in vision-language-action (VLA) models for general robot decision-making. However, the majority of existing VLAs fail to account for the inevitable external perturbations encountered during deployment. These perturbations introduce unforeseen state information to the VLA, resulting in inaccurate actions and consequently, a significant decline in generalization performance. The classic internal model control (IMC) principle demonstrates that a closed-loop system with an internal model that includes external input signals can accurately track the reference input and effectively offset the disturbance. We propose a novel closed-loop VLA method GEVRM that integrates the IMC principle to enhance the robustness of robot visual manipulation. The text-guided video generation model in GEVRM can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Vision and Imaging · Visual Attention and Saliency Detection · Reinforcement Learning in Robotics
