V-CAGE: Context-Aware Generation and Verification for Scalable Long-Horizon Embodied Tasks

Yaru Liu; Ao-bo Wang; Nanyang Ye

arXiv:2601.15164·cs.RO·January 22, 2026

V-CAGE: Context-Aware Generation and Verification for Scalable Long-Horizon Embodied Tasks

Yaru Liu, Ao-bo Wang, Nanyang Ye

PDF

Open Access

TL;DR

V-CAGE is a framework that generates physically plausible, semantically aligned datasets for long-horizon embodied tasks by enforcing geometric consistency, hierarchical instruction decomposition, and semantic verification, improving policy success and generalization.

Contribution

The paper introduces V-CAGE, a novel closed-loop system combining geometric consistency, hierarchical task decomposition, and visual language model verification for scalable, high-fidelity embodied task datasets.

Findings

01

Datasets generated by V-CAGE have higher physical and semantic fidelity.

02

V-CAGE significantly improves downstream policy success rates.

03

Semantic verification reduces silent failures in task execution.

Abstract

Learning long-horizon embodied behaviors from synthetic data remains challenging because generated scenes are often physically implausible, language-driven programs frequently "succeed" without satisfying task semantics, and high-level instructions require grounding into executable action sequences. To address these limitations, we introduce V-CAGE, a closed-loop framework for generating robust, semantically aligned manipulation datasets at scale. First, we propose a context-aware instantiation mechanism that enforces geometric consistency during scene synthesis. By dynamically maintaining a map of prohibited spatial areas as objects are placed, our system prevents interpenetration and ensures reachable, conflict-free configurations in cluttered environments. Second, to bridge the gap between abstract intent and low-level control, we employ a hierarchical instruction decomposition…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis