Multi-Agent Planning Using Visual Language Models

Michele Brienza; Francesco Argenziano; Vincenzo Suriani; Domenico D.; Bloisi; Daniele Nardi

arXiv:2408.05478·cs.AI·December 31, 2024

Multi-Agent Planning Using Visual Language Models

Michele Brienza, Francesco Argenziano, Vincenzo Suriani, Domenico D., Bloisi, Daniele Nardi

PDF

Open Access

TL;DR

This paper introduces a multi-agent planning system using visual language models that operates with minimal environment data, leveraging commonsense knowledge, and includes a new automatic evaluation method validated on the ALFRED dataset.

Contribution

Proposes a novel multi-agent architecture for embodied task planning that requires only a single environment image and introduces an automatic plan evaluation procedure.

Findings

01

The approach effectively handles free-form domains with minimal data.

02

The new evaluation method PG2S correlates well with plan quality.

03

Validated on ALFRED dataset, outperforming existing metrics.

Abstract

Large Language Models (LLMs) and Visual Language Models (VLMs) are attracting increasing interest due to their improving performance and applications across various domains and tasks. However, LLMs and VLMs can produce erroneous results, especially when a deep understanding of the problem domain is required. For instance, when planning and perception are needed simultaneously, these models often struggle because of difficulties in merging multi-modal information. To address this issue, fine-tuned models are typically employed and trained on specialized data structures representing the environment. This approach has limited effectiveness, as it can overly complicate the context for processing. In this paper, we propose a multi-agent architecture for embodied task planning that operates without the need for specific data structures as input. Instead, it uses a single image of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Multimodal Machine Learning Applications