Hierarchical Vision Language Action Model Using Success and Failure Demonstrations

Jeongeun Park; Jihwan Yoon; Byungwoo Jeon; Juhan Park; Jinwoo Shin; Namhoon Cho; Kyungjae Lee; Sangdoo Yun; Sungjoon Choi

arXiv:2512.03913·cs.RO·December 4, 2025

Hierarchical Vision Language Action Model Using Success and Failure Demonstrations

Jeongeun Park, Jihwan Yoon, Byungwoo Jeon, Juhan Park, Jinwoo Shin, Namhoon Cho, Kyungjae Lee, Sangdoo Yun, Sungjoon Choi

PDF

Open Access

TL;DR

This paper introduces VINE, a hierarchical vision-language-action model that leverages both success and failure demonstrations to improve robustness and success rates in manipulation tasks by using failure data as a structured learning signal.

Contribution

VINE is a novel hierarchical model that incorporates failure data into planning, enabling more robust decision-making in vision-language-action tasks.

Findings

01

VINE improves success rates across manipulation tasks.

02

Failure data enhances robustness and decision-making.

03

Hierarchical reasoning effectively utilizes mixed-quality datasets.

Abstract

Prior Vision-Language-Action (VLA) models are typically trained on teleoperated successful demonstrations, while discarding numerous failed attempts that occur naturally during data collection. However, these failures encode where and how policies can be fragile, information that can be exploited to improve robustness. We address this problem by leveraging mixed-quality datasets to learn failure-aware reasoning at planning time. We introduce VINE, a hierarchical vision-language-action model that separates high-level reasoning (System 2) from low-level control (System 1) under a hierarchical reinforcement learning formalism, making failures usable as a structured learning signal rather than noisy supervision. System 2 performs feasibility-guided tree search over a 2D scene-graph abstraction: it proposes subgoal transitions, predicts success probabilities from both successes and failures,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Robotic Path Planning Algorithms