TL;DR
This paper introduces LODGE, a framework that autonomously learns hierarchical domain models for autonomous agents using LLMs and environment grounding, improving accuracy and efficiency without human input.
Contribution
LODGE is a novel, task-agnostic framework that automatically generates and refines hierarchical domain models from LLMs and environment simulations, reducing reliance on human feedback.
Findings
LODGE produces more accurate domain models than existing methods.
LODGE achieves higher task success rates in multiple IPC and robotic domains.
LODGE requires fewer environment interactions and no human feedback.
Abstract
Domain models enable autonomous agents to solve long-horizon tasks by producing interpretable plans. However, in open-world environments, a single general domain model cannot capture the variety of tasks, so agents must generate suitable task-specific models on the fly. Large Language Models (LLMs), with their implicit common knowledge, can generate such domains, but suffer from high error rates that limit their applicability. Hence, related work relies on extensive human feed-back or prior knowledge, which undermines autonomous, open-world deployment. In this work, we propose LODGE, a framework for autonomous domain learning from LLMs and environment grounding. LODGE builds on hierarchical abstractions and automated simulations to identify and correct inconsistencies between abstraction layers and between the model and environment. Our framework is task-agnostic, as it generates…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper is looking at an important problem, and something that can't currently be solved by simply calling an LLM. The automatic generation of valid symbolic models can allow the use of provably sound planners in many mission-critical settings. The evaluation also shows that the proposed system does provide some advantages over some existing alternatives.
The method described here is just relying on repeated invocation of LLMs to perform refinement and model generation (and in the case of pseudo-labeling, a VLM). The hope is that, given the right feedback, the LLM should be able to find the right models and symbols. However, apart from the limited empirical evidence they can show, I cannot imagine the approach being able to provide any kind of theoretical guarantees. More importantly, the paper completely overlooks many of the most important a
1. This paper aims to automatically generate planning domains to reduce manual engineering efforts, which is a valuable goal for the field. The proposed framework, which contains hierarchical generation and error recovery, is a reasonable design that can effectively improve the accuracy of the generated domain models. 2. The paper is well organized.
1. The experimental evaluation is limited. Although the FurnitureBench dataset contains multiple assembly tasks, the experiments only report results for the lamp assembly case. To demonstrate the robustness and generalizability of the proposed approach, experiments should be conducted across all available task categories in FurnitureBench. In addition, more tasks, such as those introduced in [1], should be used to further evaluate the method’s general applicability. 2. The paper overlooks severa
1. The paper is well written overall. The messages are conveyed clearly. 2. The proposed hierarchical domain models can be interesting if supported by more thorough experiments.
The three novelties claimed by this paper are fairly weak for the following reasons: 1. Novelty 1 "Domain learning from planning feedback": While this paper claims only minimum prior knowledge is required, the initial domain and some critical predicates are provided by the user as shown in Appendix H1. The idea of using model-based environment feedback to optimize predicates is similar to the prior paper [1], which compromises the novelty of this paper. 2. Novelty 2 "Hierarchical domain models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
