Experience-based Knowledge Correction for Robust Planning in Minecraft

Seungjoon Lee; Suhwan Kim; Minhyeon Oh; Youngsik Yoon; Jungseul Ok

arXiv:2505.24157·cs.LG·February 19, 2026

Experience-based Knowledge Correction for Robust Planning in Minecraft

Seungjoon Lee, Suhwan Kim, Minhyeon Oh, Youngsik Yoon, Jungseul Ok

PDF

Open Access 3 Reviews

TL;DR

XENON is an agent that improves knowledge correction in Minecraft planning by learning from experience, enabling robust long-horizon planning even with flawed priors and limited feedback.

Contribution

The paper introduces XENON, a novel approach that algorithmically revises knowledge using experience, enhancing robustness in LLM-based planning in Minecraft.

Findings

01

XENON outperforms prior agents in knowledge learning and planning.

02

XENON surpasses larger models with only a 7B open-weight LLM.

03

The approach effectively corrects item dependencies and actions from limited feedback.

Abstract

Large Language Model (LLM)-based planning has advanced embodied agents in long-horizon environments such as Minecraft, where acquiring latent knowledge of goal (or item) dependencies and feasible actions is critical. However, LLMs often begin with flawed priors and fail to correct them through prompting, even with feedback. We present XENON (eXpErience-based kNOwledge correctioN), an agent that algorithmically revises knowledge from experience, enabling robustness to flawed priors and sparse binary feedback. XENON integrates two mechanisms: Adaptive Dependency Graph, which corrects item dependencies using past successes, and Failure-aware Action Memory, which corrects action knowledge using past failures. Together, these components allow XENON to acquire complex dependencies despite limited guidance. Experiments across multiple Minecraft benchmarks show that XENON outperforms prior…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

- The paper proposes a novel algorithmic approach for planning that does not require retraining LLMs, offering an efficient solution to improve LLMs’ knowledge representation from experience. - The ablation study clearly demonstrates that Adaptive Dependency Graph (ADG) and FAM are highly robust to initial errors. They effectively recover the correct dependency graph despite the presence of flawed priors. Moreover, no additional LLMs are required to generate these priors, which greatly enhances

Weaknesses

1. The paper states (Line 126) that knowledge is modeled as a directed acyclic graph. However, it is unclear how the system detects and resolves cyclic structures that may arise during initialization ( e.g. when a wooden axe is required to obtain logs, logs produce planks, and planks are used to craft the axe). I suggest clarifying whether cycles are possible in practice and, if so, how they are algorithmically identified and dismantled. 2. In Line 212, the method refers to selecting required it

Reviewer 02Rating 4Confidence 3

Strengths

1. Learning the recipes by interacting with the environment makes sense and I appreciate its integration with the LLM agent. 2. The results show nice improvements over the ablation and baseline. 3. The solution is designed to handle execution delays

Weaknesses

1. The paper is not sufficiently self-contained. In particular, there authors assume the reader knows how DECKARD works (e.g.,in line 172). Note that DECKARD was only published in Arxiv and it not well known. 2. I am not convinced about the generality of the approach because it relies on hyper parameters that I guess are tuned for this domain. 3. Some details are not clear, and there are quite a few design choices that seem arbitrary to me. See below in the list of questions.

Reviewer 03Rating 6Confidence 5

Strengths

1. The paper tackles a real failure mode: LLM priors over item dependencies and actions are brittle and hard to self-correct with only binary feedback. Treating the LLM as a planner while moving correction into algorithmic external memory is a crisp, testable design choice. 2. Using Qwen2.5-VL-7B, XENON outperforms or rivals methods using GPT-4/4V on several long-horizon task groups, particularly when oracle dependencies are provided. It still performs strongly with learned dependencies on chal

Weaknesses

1. The proposed method is simple and intuitive, akin to prompting an LLM to correct memory and graphs, lacking technical innovation. 2. The procedure references “similar, successfully obtained items” but the similarity function, features, and retrieval specifics are not fully spelled out in the main text (embedding choice, distance metric, negatives, and sensitivity). This matters because ADG’s replacement set can systematically bias learning if similarity is noisy. 3. The design hinges on $c_

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotic Path Planning Algorithms · AI-based Problem Solving and Planning · Logic, Reasoning, and Knowledge