Goal-VLA: Image-Generative VLMs as Object-Centric World Models Empowering Zero-shot Robot Manipulation

Haonan Chen; Jingxiang Guo; Bangjun Wang; Tianrui Zhang; Xuchuan Huang; Boren Zheng; Yiwen Hou; Chenrui Tie; Jiajun Deng; Lin Shao

arXiv:2506.23919·cs.RO·March 31, 2026

Goal-VLA: Image-Generative VLMs as Object-Centric World Models Empowering Zero-shot Robot Manipulation

Haonan Chen, Jingxiang Guo, Bangjun Wang, Tianrui Zhang, Xuchuan Huang, Boren Zheng, Yiwen Hou, Chenrui Tie, Jiajun Deng, Lin Shao

PDF

1 Repo

TL;DR

Goal-VLA introduces a zero-shot manipulation framework using Image-Generative VLMs as world models to generate goal states, enabling generalizable robot manipulation without explicit action annotations.

Contribution

It leverages Image-Generative VLMs for goal state generation and introduces a Reflection-through-Synthesis process to enhance robustness in manipulation tasks.

Findings

01

Achieves strong zero-shot manipulation performance in simulation and real-world.

02

Uses object state representation as a key interface for high-level and low-level policy separation.

03

Demonstrates generalizability across diverse manipulation scenarios.

Abstract

Generalization remains a fundamental challenge in robotic manipulation. To tackle this challenge, recent Vision-Language-Action (VLA) models build policies on top of Vision-Language Models (VLMs), seeking to transfer their open-world semantic knowledge. However, their zero-shot capability lags significantly behind the base VLMs, as the instruction-vision-action data is too limited to cover diverse scenarios, tasks, and robot embodiments. In this work, we present Goal-VLA, a zero-shot framework that leverages Image-Generative VLMs as world models to generate desired goal states, from which the target object pose is derived to enable generalizable manipulation. The key insight is that object state representation is the golden interface, naturally separating a manipulation system into high-level and low-level policies. This representation abstracts away explicit action annotations,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://nus-lins-lab.github.io/goalvlaweb
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.