GEVRM: Goal-Expressive Video Generation Model For Robust Visual   Manipulation

Hongyin Zhang; Pengxiang Ding; Shangke Lyu; Ying Peng; Donglin Wang

arXiv:2502.09268·cs.RO·February 17, 2025

GEVRM: Goal-Expressive Video Generation Model For Robust Visual Manipulation

Hongyin Zhang, Pengxiang Ding, Shangke Lyu, Ying Peng, Donglin Wang

PDF

Open Access 1 Video

TL;DR

GEVRM is a novel goal-expressive video generation model that enhances the robustness of vision-language-action systems in robots by integrating internal model control principles and perturbation evaluation.

Contribution

It introduces a closed-loop VLA framework with a text-guided video generator and perturbation inference, improving robustness against external disturbances.

Findings

01

Achieves state-of-the-art results on CALVIN benchmarks.

02

Significantly improves performance in realistic robot tasks.

03

Effectively distinguishes external perturbations through internal embeddings.

Abstract

With the rapid development of embodied artificial intelligence, significant progress has been made in vision-language-action (VLA) models for general robot decision-making. However, the majority of existing VLAs fail to account for the inevitable external perturbations encountered during deployment. These perturbations introduce unforeseen state information to the VLA, resulting in inaccurate actions and consequently, a significant decline in generalization performance. The classic internal model control (IMC) principle demonstrates that a closed-loop system with an internal model that includes external input signals can accurately track the reference input and effectively offset the disturbance. We propose a novel closed-loop VLA method GEVRM that integrates the IMC principle to enhance the robustness of robot visual manipulation. The text-guided video generation model in GEVRM can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

GEVRM: Goal-Expressive Video Generation Model For Robust Visual Manipulation· slideslive

Taxonomy

TopicsAdvanced Vision and Imaging · Visual Attention and Saliency Detection · Reinforcement Learning in Robotics