TL;DR
Goal-VLA introduces a zero-shot manipulation framework using Image-Generative VLMs as world models to generate goal states, enabling generalizable robot manipulation without explicit action annotations.
Contribution
It leverages Image-Generative VLMs for goal state generation and introduces a Reflection-through-Synthesis process to enhance robustness in manipulation tasks.
Findings
Achieves strong zero-shot manipulation performance in simulation and real-world.
Uses object state representation as a key interface for high-level and low-level policy separation.
Demonstrates generalizability across diverse manipulation scenarios.
Abstract
Generalization remains a fundamental challenge in robotic manipulation. To tackle this challenge, recent Vision-Language-Action (VLA) models build policies on top of Vision-Language Models (VLMs), seeking to transfer their open-world semantic knowledge. However, their zero-shot capability lags significantly behind the base VLMs, as the instruction-vision-action data is too limited to cover diverse scenarios, tasks, and robot embodiments. In this work, we present Goal-VLA, a zero-shot framework that leverages Image-Generative VLMs as world models to generate desired goal states, from which the target object pose is derived to enable generalizable manipulation. The key insight is that object state representation is the golden interface, naturally separating a manipulation system into high-level and low-level policies. This representation abstracts away explicit action annotations,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
