Learning Structured Robot Policies from Vision-Language Models via Synthetic Neuro-Symbolic Supervision

Alessandro Adami; Tommaso Tubaldo; Marco Todescato; Ruggero Carli; Pietro Falco

arXiv:2604.02812·cs.RO·May 18, 2026

Learning Structured Robot Policies from Vision-Language Models via Synthetic Neuro-Symbolic Supervision

Alessandro Adami, Tommaso Tubaldo, Marco Todescato, Ruggero Carli, Pietro Falco

PDF

TL;DR

This paper presents a neuro-symbolic method that enables large vision-language models to generate interpretable, structured robot policies from visual and language inputs, trained via synthetic data, and successfully transferred to real robots.

Contribution

It introduces a scalable pipeline for generating synthetic multimodal datasets and demonstrates how a 12B-parameter model can learn structured policies for robotic manipulation.

Findings

01

Structured policies achieve zero-shot transfer to real robots.

02

Synthetic data generation bypasses the data bottleneck in robotic planning.

03

Neuro-symbolic policies outperform end-to-end visuomotor approaches in interpretability.

Abstract

Vision-Language Models (VLMs) have recently demonstrated strong capabilities in mapping multimodal observations to robot behaviors. However, most current approaches rely on end-to-end visuomotor policies that remain opaque and difficult to analyze, limiting their use in real-world robotic applications. In contrast, classical robotic systems often rely on structured policy representations that provide interpretability, modularity, and reactive execution. This work investigates how foundation models can be specialized to generate structured robot policies grounded in multimodal perception, bridging high-dimensional learning and symbolic control. We propose a neuro-symbolic approach in which a VLM synthesizes executable Behavior Tree policies from visual observations, natural language instructions, and structured system specifications. To enable scalable supervision without manual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.