Multimodal Behavior Tree Generation: A Small Vision-Language Model for Robot Task Planning

Cristiano Battistini; Riccardo Andrea Izzo; Gianluca Bardaro; and Matteo Matteucci

arXiv:2603.06084·cs.RO·March 9, 2026

Multimodal Behavior Tree Generation: A Small Vision-Language Model for Robot Task Planning

Cristiano Battistini, Riccardo Andrea Izzo, Gianluca Bardaro, and Matteo Matteucci

PDF

Open Access

TL;DR

This paper introduces a multimodal vision-language model that generates behavior trees for robot task planning, leveraging a new dataset and fine-tuning techniques to achieve high success rates with reduced computational costs.

Contribution

It presents a novel dataset linking visual observations and instructions to behavior trees and fine-tunes a compact VLM to perform robotic task planning effectively.

Findings

01

Achieves 87% success rate in household tasks

02

Approaches performance of state-of-the-art models

03

Uses significantly fewer computational resources

Abstract

Large and small language models have been widely used for robotic task planning. At the same time, vision-language models (VLMs) have successfully tackled problems such as image captioning, scene understanding, and visual question answering. In this work, we combine these two approaches by deploying a compact, open-source multimodal model to generate behavior trees for robotic task planning. The main obstacle to achieving this goal is the lack of an existing dataset that links visual observations and instructions to executable behavior trees. We propose a method to construct such a dataset starting from existing robotic episodes (i.e., Open X-Embodiment), in which a large model serves as a teacher in a multi-stage generation pipeline. We use this dataset to fine-tune VLMs ranging from 500M to 4B parameters via parameter-efficient fine-tuning (PEFT). The generated behavior trees,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Robot Manipulation and Learning