From An LLM Swarm To A PDDL-Empowered HIVE: Planning Self-Executed Instructions In A Multi-Modal Jungle

Kaustubh Vyas; Damien Graux; Yijun Yang; S\'ebastien Montella; Chenxin Diao; Wendi Zhou; Pavlos Vougiouklis; Ruofei Lai; Yang Ren; Keshuang Li; Jeff Z. Pan

arXiv:2412.12839·cs.AI·September 30, 2025

From An LLM Swarm To A PDDL-Empowered HIVE: Planning Self-Executed Instructions In A Multi-Modal Jungle

Kaustubh Vyas, Damien Graux, Yijun Yang, S\'ebastien Montella, Chenxin Diao, Wendi Zhou, Pavlos Vougiouklis, Ruofei Lai, Yang Ren, Keshuang Li, Jeff Z. Pan

PDF

Open Access 3 Reviews

TL;DR

This paper presents Hive, a multi-modal, knowledge-aware planning system that uses LLMs and PDDL to generate explainable, complex action sequences for real-world queries, outperforming existing methods.

Contribution

Introduces Hive, a novel multi-modal agent system leveraging LLMs and PDDL for explainable, constraint-aware planning of atomic actions across diverse models.

Findings

01

Hive outperforms existing systems in task selection accuracy.

02

The MuSE benchmark effectively evaluates multi-modal agent capabilities.

03

Hive guarantees explainability and user constraint adherence in complex tasks.

Abstract

In response to the call for agent-based solutions that leverage the ever-increasing capabilities of the deep models' ecosystem, we introduce Hive -- a comprehensive solution for knowledge-aware planning of a set of atomic actions to address input queries and subsequently selecting appropriate models accordingly. Hive operates over sets of models and, upon receiving natural language instructions (i.e. user queries), schedules and executes explainable plans of atomic actions. These actions can involve one or more of the available models to achieve the overall task, while respecting end-users specific constraints. Notably, Hive handles tasks that involve multi-modal inputs and outputs, enabling it to handle complex, real-world queries. Our system is capable of planning complex chains of actions while guaranteeing explainability, using an LLM-based formal logic backbone empowered by PDDL…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 3

Strengths

The KG construction method is thorough, utilizing model metadata from multiple sources to enhance graph quality. The entire planning pipeline demonstrates a clear improvement over prior work, such as HuggingGPT. Additionally, the writing is well-structured and easy to follow, which supports reader comprehension.

Weaknesses

This paper has several notable weaknesses: First, it feels like the approach relies heavily on leveraging a powerful LLM to decompose user tasks based on key attributes, selecting open-source models from Huggingface, and executing plans sequentially. In essence, the paper presents a tool-usage framework where the tools are Huggingface models. While there are some unique designs for processing and indexing these models, the contribution appears marginal. The problem-solving paradigm closely rese

Reviewer 02Rating 6Confidence 3

Strengths

Originality -They claim to be the first multiple model approach to leverage PDDL. -Their approach consists of intuitive steps, achieved using new ideas such as PDDL for planning and C-KG for model selection. -The multi-modal benchmark data set appears to be a new contribution in the multiple model task domain. More clarity here would be great, see my comments below. Quality -The idea of using model cards systematically to leverage new models is very practical and useful. I can see this being h

Weaknesses

-PDDL was not defined until later in the paper - define it in the abstract and/or introduction. -Not clear if MuSE will be openly available -The interaction between PDDL and LLM (ChatGPT) in task decomposition could be made clearer - what exactly is the LLM doing and how is it leveraging PDDL? I think since PDDL could be new for readers, being clear here is very important.

Reviewer 03Rating 5Confidence 3

Strengths

The papers proposes a method to automatically extract models with capabilities of various modalities from Hugging Face, and explains in details the methodology. The automatic system also ease the user query but automatically choose the correct model from the pool as the detailed requirement by user in natural language, which is quite meaningful for the rapid-growing multi-modal foundational models in the community. Experiments show that even the light version can beat the baselines much except

Weaknesses

Although an automatic exaction system designed, the number of final candidates for HIVE are unknown. Appendix A only mentions a very small sets of models, most of them fall behind SOTA much (e.g. in image generation, machine translation, text generation). (also see questions) The paper fails to mention the context why such a system is necessary given that most recent foundational models are for general purpose in its specific modality or multi-modality domain, and user can easily choose a model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOpen Education and E-Learning · Online and Blended Learning

MethodsSparse Evolutionary Training