V-CAGE: Context-Aware Generation and Verification for Scalable Long-Horizon Embodied Tasks
Yaru Liu, Ao-bo Wang, Nanyang Ye

TL;DR
V-CAGE is a framework that generates physically plausible, semantically aligned datasets for long-horizon embodied tasks by enforcing geometric consistency, hierarchical instruction decomposition, and semantic verification, improving policy success and generalization.
Contribution
The paper introduces V-CAGE, a novel closed-loop system combining geometric consistency, hierarchical task decomposition, and visual language model verification for scalable, high-fidelity embodied task datasets.
Findings
Datasets generated by V-CAGE have higher physical and semantic fidelity.
V-CAGE significantly improves downstream policy success rates.
Semantic verification reduces silent failures in task execution.
Abstract
Learning long-horizon embodied behaviors from synthetic data remains challenging because generated scenes are often physically implausible, language-driven programs frequently "succeed" without satisfying task semantics, and high-level instructions require grounding into executable action sequences. To address these limitations, we introduce V-CAGE, a closed-loop framework for generating robust, semantically aligned manipulation datasets at scale. First, we propose a context-aware instantiation mechanism that enforces geometric consistency during scene synthesis. By dynamically maintaining a map of prohibited spatial areas as objects are placed, our system prevents interpenetration and ensures reachable, conflict-free configurations in cluttered environments. Second, to bridge the gap between abstract intent and low-level control, we employ a hierarchical instruction decomposition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis
