MaestroMotif: Skill Design from Artificial Intelligence Feedback
Martin Klissarov, Mikael Henaff, Roberta Raileanu, Shagun Sodhani,, Pascal Vincent, Amy Zhang, Pierre-Luc Bacon, Doina Precup, Marlos C. Machado,, Pierluca D'Oro

TL;DR
MaestroMotif is a novel AI-assisted skill design method that uses Large Language Models to automatically create, train, and combine skills from natural language descriptions, resulting in high-performing, adaptable agents.
Contribution
It introduces a new approach leveraging LLM feedback and code generation for skill design and reinforcement learning, enhancing AI agent performance and flexibility.
Findings
Outperforms existing methods in NetHack Learning Environment tasks.
Demonstrates high adaptability and usability in complex environments.
Shows effective reuse of skills generated from natural language descriptions.
Abstract
Describing skills in natural language has the potential to provide an accessible way to inject human knowledge about decision-making into an AI system. We present MaestroMotif, a method for AI-assisted skill design, which yields high-performing and adaptable agents. MaestroMotif leverages the capabilities of Large Language Models (LLMs) to effectively create and reuse skills. It first uses an LLM's feedback to automatically design rewards corresponding to each skill, starting from their natural language description. Then, it employs an LLM's code generation abilities, together with reinforcement learning, for training the skills and combining them to implement complex behaviors specified in language. We evaluate MaestroMotif using a suite of complex tasks in the NetHack Learning Environment (NLE), demonstrating that it surpasses existing approaches in both performance and usability.
Peer Reviews
Decision·ICLR 2025 Oral
Overall, I believe MaestroMotif presents an effective approach to incorporating human prior knowledge to solve specific tasks. Given that NetHack tasks involve a variety of challenges, including understanding natural language descriptions, planning over high-level abstractions, and exploration, it is notable that the authors demonstrated improved performance in the NetHack environment using their method. Additionally, I find it interesting that the authors showed that learning skills simultaneou
- **Source of Performance Gains** I wonder whether the performance gain primarily comes from the human domain knowledge or the method itself. It appears that the method relies heavily on human effort and domain knowledge to solve individual tasks. For instance, humans select a set of skills such as *Discoverer, Descender, Ascender, Merchant,* and *Worshipper,* and they even required to modify prompts for acquiring each of these skills. Therefore, when comparing this work with baselines, it’s
* MaestroMotif fits in well with the presented prior work and literature, and is a logical extension of much of the "code-as-policies" literature from LLMs, combining human demonstrations and data with AI planning and code-synthesis. * The baselines used (LLM as policy, PPO, etc.) are sensible baselines and suitable cover the different methods that one might employ to attempt to learn arbitrary NetHack policies for zero-shot transfer to new skills or tasks. * The paper presents an approach that
* Learning the options networks seems to require intense manual effort, as a human must label different states with preferences/task-alignment to learn different options. * While MaestroMotif outperforms the baselines for arbitrary tasks, it does seem to be pitted against methods that have no real way of generalizing to such tasks (for example, are the score-maximization approaches capable of receiving sub-goals or goals in any way?). The success of MaestroMotif is noteworthy, but the delta over
- The experiments are comprehensive, covering most baselines one would expect and extensive ablations (like comparing skills learned in isolation vs simultaneously, LLM scale, and the architecture). Furthermore, the results clearly demonstrate the effectiveness of MaestroMotif. - To me, the results of this paper mark a big improvement in the domain of hierarchical RL, where LLM-generated code is used to conduct high-level planning while a small RL network controls low-level actions. - The propos
- The technique only trains reward functions on an existing offline dataset, which may be out of distribution with respect to the distribution of states encountered by the trained RL agent. This choice is borrowed from the original Motif paper (which said that the reward model was not fine-tuned during RL training due to simplicity), but based on the poor performance of the RL baselines (used to collect the original dataset) it seems like there may be a greater distribution shift in this work. R
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI and HR Technologies
