SYMBOLIZER: Symbolic Model-free Task Planning with VLMs

Sami Azirar; Zlatan Ajanovic; Hermann Blum

arXiv:2604.17830·cs.RO·April 21, 2026

SYMBOLIZER: Symbolic Model-free Task Planning with VLMs

Sami Azirar, Zlatan Ajanovic, Hermann Blum

PDF

TL;DR

This paper introduces SYMBOLIZER, a framework that combines Visual Language Models with symbolic planning to enable scalable, domain-independent task planning without handcrafted models, achieving state-of-the-art results.

Contribution

The proposed method uses VLMs to ground symbolic states from images and performs heuristic search without action models, generalizing well to unseen problems.

Findings

01

Outperforms direct VLM-based planning in experiments.

02

Achieves state-of-the-art results on ProDG and ViPlan benchmarks.

03

Operates effectively across diverse domains with large state spaces.

Abstract

Traditional Task and Motion Planning (TAMP) systems depend on physics models for motion planning and discrete symbolic models for task planning. Although physics model are often available, symbolic models (consisting of symbolic state interpretation and action models) must be meticulously handcrafted or learned from labeled data. This process is both resource-intensive and constrains the solution to the specific domain, limiting scalability and adaptability. On the other hand, Visual Language Models (VLMs) show desirable zero-shot visual understanding (due to their extensive training on heterogeneous data), but still achieve limited planning capabilities. Therefore, integrating VLMs with classical planning for long-horizon reasoning in TAMP problems offers high potential. Recent works in this direction still lack generality and depend on handcrafted, task-specific solutions, e.g.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.