V-CAGE: Vision-Closed-Loop Agentic Generation Engine for Robotic Manipulation

Yaru Liu; Ao-bo Wang; Nanyang Ye

arXiv:2604.09036·cs.RO·April 13, 2026

V-CAGE: Vision-Closed-Loop Agentic Generation Engine for Robotic Manipulation

Yaru Liu, Ao-bo Wang, Nanyang Ye

PDF

TL;DR

V-CAGE is an autonomous framework that synthesizes high-quality, semantically rich robotic manipulation datasets by combining scene construction, visual verification, and efficient compression, enabling scalable data generation.

Contribution

It introduces an embodied agentic system that automates scene generation, verification, and compression for robotic datasets, improving semantic coherence and physical feasibility.

Findings

01

Achieves over 90% filesize reduction without losing training quality.

02

Ensures scenes are semantically structured and kinematically reachable.

03

Automates end-to-end dataset synthesis for robotic manipulation.

Abstract

Scaling Vision-Language-Action (VLA) models requires massive datasets that are both semantically coherent and physically feasible. However, existing scene generation methods often lack context-awareness, making it difficult to synthesize high-fidelity environments embedded with rich semantic information, frequently resulting in unreachable target positions that cause tasks to fail prematurely. We present V-CAGE (Vision-Closed-loop Agentic Generation Engine), an agentic framework for autonomous robotic data synthesis. Unlike traditional scripted pipelines, V-CAGE operates as an embodied agentic system, leveraging foundation models to bridge high-level semantic reasoning with low-level physical interaction. Specifically, we introduce Inpainting-Guided Scene Construction to systematically arrange context-aware layouts, ensuring that the generated scenes are both semantically structured and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.