V-CAGE: Vision-Closed-Loop Agentic Generation Engine for Robotic Manipulation
Yaru Liu, Ao-bo Wang, Nanyang Ye

TL;DR
V-CAGE is an autonomous framework that synthesizes high-quality, semantically rich robotic manipulation datasets by combining scene construction, visual verification, and efficient compression, enabling scalable data generation.
Contribution
It introduces an embodied agentic system that automates scene generation, verification, and compression for robotic datasets, improving semantic coherence and physical feasibility.
Findings
Achieves over 90% filesize reduction without losing training quality.
Ensures scenes are semantically structured and kinematically reachable.
Automates end-to-end dataset synthesis for robotic manipulation.
Abstract
Scaling Vision-Language-Action (VLA) models requires massive datasets that are both semantically coherent and physically feasible. However, existing scene generation methods often lack context-awareness, making it difficult to synthesize high-fidelity environments embedded with rich semantic information, frequently resulting in unreachable target positions that cause tasks to fail prematurely. We present V-CAGE (Vision-Closed-loop Agentic Generation Engine), an agentic framework for autonomous robotic data synthesis. Unlike traditional scripted pipelines, V-CAGE operates as an embodied agentic system, leveraging foundation models to bridge high-level semantic reasoning with low-level physical interaction. Specifically, we introduce Inpainting-Guided Scene Construction to systematically arrange context-aware layouts, ensuring that the generated scenes are both semantically structured and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
