Grounded Vision-Language Interpreter for Integrated Task and Motion Planning

Jeremy Siburian; Keisuke Shirai; Cristian C. Beltran-Hernandez; Masashi Hamaya; Michael G\"orner; Atsushi Hashimoto

arXiv:2506.03270·cs.RO·November 5, 2025

Grounded Vision-Language Interpreter for Integrated Task and Motion Planning

Jeremy Siburian, Keisuke Shirai, Cristian C. Beltran-Hernandez, Masashi Hamaya, Michael G\"orner, Atsushi Hashimoto

PDF

TL;DR

This paper introduces ViLaIn-TAMP, a hybrid planning framework combining vision-language interpretation with symbolic and geometric planning, enhancing safety, interpretability, and success rates in robot task execution.

Contribution

It presents a novel hybrid framework that integrates vision-language models with symbolic and geometric planning, including a feedback loop for refinement, improving robot planning reliability.

Findings

01

Outperforms baseline by 18% in success rate

02

Adding corrective planning boosts success by 32%

03

Effective in complex cooking manipulation tasks

Abstract

While recent advances in vision-language models have accelerated the development of language-guided robot planners, their black-box nature often lacks safety guarantees and interpretability crucial for real-world deployment. Conversely, classical symbolic planners offer rigorous safety verification but require significant expert knowledge for setup. To bridge the current gap, this paper proposes ViLaIn-TAMP, a hybrid planning framework for enabling verifiable, interpretable, and autonomous robot behaviors. ViLaIn-TAMP comprises three main components: (1) a Vision-Language Interpreter (ViLaIn) adapted from previous work that converts multimodal inputs into structured problem specifications, (2) a modular Task and Motion Planning (TAMP) system that grounds these specifications in actionable trajectory sequences through symbolic and geometric constraint reasoning, and (3) a corrective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.