From An LLM Swarm To A PDDL-Empowered HIVE: Planning Self-Executed Instructions In A Multi-Modal Jungle
Kaustubh Vyas, Damien Graux, Yijun Yang, S\'ebastien Montella, Chenxin Diao, Wendi Zhou, Pavlos Vougiouklis, Ruofei Lai, Yang Ren, Keshuang Li, Jeff Z. Pan

TL;DR
This paper presents Hive, a multi-modal, knowledge-aware planning system that uses LLMs and PDDL to generate explainable, complex action sequences for real-world queries, outperforming existing methods.
Contribution
Introduces Hive, a novel multi-modal agent system leveraging LLMs and PDDL for explainable, constraint-aware planning of atomic actions across diverse models.
Findings
Hive outperforms existing systems in task selection accuracy.
The MuSE benchmark effectively evaluates multi-modal agent capabilities.
Hive guarantees explainability and user constraint adherence in complex tasks.
Abstract
In response to the call for agent-based solutions that leverage the ever-increasing capabilities of the deep models' ecosystem, we introduce Hive -- a comprehensive solution for knowledge-aware planning of a set of atomic actions to address input queries and subsequently selecting appropriate models accordingly. Hive operates over sets of models and, upon receiving natural language instructions (i.e. user queries), schedules and executes explainable plans of atomic actions. These actions can involve one or more of the available models to achieve the overall task, while respecting end-users specific constraints. Notably, Hive handles tasks that involve multi-modal inputs and outputs, enabling it to handle complex, real-world queries. Our system is capable of planning complex chains of actions while guaranteeing explainability, using an LLM-based formal logic backbone empowered by PDDL…
Peer Reviews
Decision·ICLR 2025 Poster
The KG construction method is thorough, utilizing model metadata from multiple sources to enhance graph quality. The entire planning pipeline demonstrates a clear improvement over prior work, such as HuggingGPT. Additionally, the writing is well-structured and easy to follow, which supports reader comprehension.
This paper has several notable weaknesses: First, it feels like the approach relies heavily on leveraging a powerful LLM to decompose user tasks based on key attributes, selecting open-source models from Huggingface, and executing plans sequentially. In essence, the paper presents a tool-usage framework where the tools are Huggingface models. While there are some unique designs for processing and indexing these models, the contribution appears marginal. The problem-solving paradigm closely rese
Originality -They claim to be the first multiple model approach to leverage PDDL. -Their approach consists of intuitive steps, achieved using new ideas such as PDDL for planning and C-KG for model selection. -The multi-modal benchmark data set appears to be a new contribution in the multiple model task domain. More clarity here would be great, see my comments below. Quality -The idea of using model cards systematically to leverage new models is very practical and useful. I can see this being h
-PDDL was not defined until later in the paper - define it in the abstract and/or introduction. -Not clear if MuSE will be openly available -The interaction between PDDL and LLM (ChatGPT) in task decomposition could be made clearer - what exactly is the LLM doing and how is it leveraging PDDL? I think since PDDL could be new for readers, being clear here is very important.
The papers proposes a method to automatically extract models with capabilities of various modalities from Hugging Face, and explains in details the methodology. The automatic system also ease the user query but automatically choose the correct model from the pool as the detailed requirement by user in natural language, which is quite meaningful for the rapid-growing multi-modal foundational models in the community. Experiments show that even the light version can beat the baselines much except
Although an automatic exaction system designed, the number of final candidates for HIVE are unknown. Appendix A only mentions a very small sets of models, most of them fall behind SOTA much (e.g. in image generation, machine translation, text generation). (also see questions) The paper fails to mention the context why such a system is necessary given that most recent foundational models are for general purpose in its specific modality or multi-modality domain, and user can easily choose a model
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOpen Education and E-Learning · Online and Blended Learning
MethodsSparse Evolutionary Training
