Learning Instruction-Following Policies through Open-Ended Instruction Relabeling with Large Language Models

Zhicheng Zhang; Ziyan Wang; Yali Du; Fei Fang

arXiv:2506.20061·cs.LG·June 26, 2025

Learning Instruction-Following Policies through Open-Ended Instruction Relabeling with Large Language Models

Zhicheng Zhang, Ziyan Wang, Yali Du, Fei Fang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a method that uses large language models to automatically relabel agent trajectories with meaningful instructions, improving instruction-following policies in reinforcement learning without extensive human annotations.

Contribution

It presents a novel open-ended instruction relabeling approach leveraging LLMs to enhance training data and learn versatile instruction-following policies from sparse rewards.

Findings

01

Improved sample efficiency in the Craftax environment

02

Enhanced instruction coverage and policy performance

03

Effective reduction of reliance on human-labeled datasets

Abstract

Developing effective instruction-following policies in reinforcement learning remains challenging due to the reliance on extensive human-labeled instruction datasets and the difficulty of learning from sparse rewards. In this paper, we propose a novel approach that leverages the capabilities of large language models (LLMs) to automatically generate open-ended instructions retrospectively from previously collected agent trajectories. Our core idea is to employ LLMs to relabel unsuccessful trajectories by identifying meaningful subtasks the agent has implicitly accomplished, thereby enriching the agent's training data and substantially alleviating reliance on human annotations. Through this open-ended instruction relabeling, we efficiently learn a unified instruction-following policy capable of handling diverse tasks within a single policy. We empirically evaluate our proposed method in…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 5

Strengths

1: The paper presents a novel paradigm for training open-ended instruction-following agents in a clear and logically coherent way. 2: The experimental results clearly demonstrate that the proposed method outperforms the baselines on the majority of tasks. 3: This method significantly improves the model's generalization capability.

Weaknesses

1: The experimental evaluation is conducted in only one environment. It is recommended to further validate the method in more environments, such as the vanilla MineCraft or Robotics. 2: The paper lacks a detailed discussion of the observed performance degradation in the experiments. Is the trade-off between this decline in performance and the improvement in semantic representation truly justified? 3: The scalability of the proposed method has not been discussed. 4: Could the authors provide

Reviewer 02Rating 4Confidence 4

Strengths

- The paper is well written and well-motivated. - The idea of using LLM to label trajectory in RL is well implemented. - The experimental results are well presented.

Weaknesses

- Leveraging Large Language Models to label data is a very general idea that has been investigated in various domains. This diminishes the novelty of the paper. - The benchmark is limited to Craftax. Therefore, it is hard to tell the generalization ability of OIR to other environments (especially larger game environments, such as Minecraft.) - The overall method of OIR looks ad hoc: the prompt, the relabeling of Failed Trajectories, reward definition, etc. Hence, it is necessary to test it on

Reviewer 03Rating 2Confidence 4

Strengths

- The paper is clearly written, well structured, and easy to follow. - The evaluation explicitly tests generalization to paraphrased and compositional instructions (simple and complex variants), not just the original instruction set.

Weaknesses

- The approach presumes environments that can provide or be mapped to textual observations to prompt the LLM. A clearer statement of the environment class (symbolic / text-describable state, discrete action space, sparse achievements) would help understand the limit of the contribution. - Results are reported on only one environment (Craftax-Classic), which limits claims of generality and leaves open whether gains depend on environment-specific engineering. - The comparison to only ELLM and PQN

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducational Technology and Assessment · Intelligent Tutoring Systems and Adaptive Learning