LGR2: Language Guided Reward Relabeling for Accelerating Hierarchical Reinforcement Learning

Utsav Singh; Pramit Bhattacharyya; Vinay P. Namboodiri

arXiv:2406.05881·cs.LG·August 28, 2025

LGR2: Language Guided Reward Relabeling for Accelerating Hierarchical Reinforcement Learning

Utsav Singh, Pramit Bhattacharyya, Vinay P. Namboodiri

PDF

Open Access 5 Reviews

TL;DR

LGR2 introduces a novel hierarchical reinforcement learning framework that uses language models to generate reward functions, significantly improving stability and efficiency in robotic tasks with sparse rewards and long horizons.

Contribution

LGR2 leverages large language models for reward relabeling in HRL, addressing non-stationarity and enhancing sample efficiency in robotic learning tasks.

Findings

01

Achieves over 55% success rate on complex tasks

02

Demonstrates robust transfer to real robots

03

Outperforms existing hierarchical and non-hierarchical methods

Abstract

Large language models (LLMs) have shown remarkable abilities in logical reasoning, in-context learning, and code generation. However, translating natural language instructions into effective robotic control policies remains a significant challenge, especially for tasks requiring long-horizon planning and operating under sparse reward conditions. Hierarchical Reinforcement Learning (HRL) provides a natural framework to address this challenge in robotics; however, it typically suffers from non-stationarity caused by the changing behavior of the lower-level policy during training, destabilizing higher-level policy learning. We introduce LGR2, a novel HRL framework that leverages LLMs to generate language-guided reward functions for the higher-level policy. By decoupling high-level reward generation from low-level policy changes, LGR2 fundamentally mitigates the non-stationarity problem in…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

- **Novel mechanism for reward shaping:** The paper identifies a key bottleneck in HRL as reward relabeling due to non-stationarity and provides a new solution that integrates language-based semantics into the process. - **Improved sample efficiency:** Empirical results show that language-informed reward signals lead to faster policy convergence and more stable subgoal learning. - **Strong empirical evidence:** Evaluations across three diverse domains with ablation studies isolating the effe

Weaknesses

- **Limited novelty relative to prior LLM-guided RL works:** Similar translator to **L2R**, which also integrate language into reward or goal inference. The paper could articulate more clearly how LGR2 surpasses these beyond implementation differences. L2R and Text2Reward are not mentioned in the related work. - **Dependence on language accuracy:** Relabeling quality is tied to the correctness and consistency of LLM-generated feedback; noisy descriptions can distort reward signals. - **Spars

Reviewer 02Rating 4Confidence 5

Strengths

The paper tackles the problem of non-stationarity in off-policy HRL, which destabilizes training and impedes convergence. The proposed solution is clear in its application: using an LLM to generate a static reward function $r_{\phi}$ that depends only on state and goals, not the low-level policy's behavior. This decoupling approach is a sound strategy for mitigating reward drift. The experimental validation covers four simulated robotic environments and real-world transfer, and the inclusion of

Weaknesses

Despite its strengths, the paper has several weaknesses that limit its impact: 1. **Limited Novelty:** The primary concern is originality. The LGR2 framework appears to be a straightforward combination of two existing, major frameworks: standard HRL and the Language-to-Reward (L2R) pipeline proposed by Yu et al. (2023). The paper’s core idea is to apply the L2R's reward generation mechanism to the high-level policy in an HRL setup. While this combination is logical, the paper would be stronger

Reviewer 03Rating 4Confidence 4

Strengths

- The problem (high-level non-stationarity) is clearly framed, with a straightforward idea: use LLM-generated reward functions to relabel the high-level buffer. - Experiments/ablations are extensive (success curves, HRL+HER comparison, and a non-stationarity metric), with 5-seed reporting.

Weaknesses

- Real-robot section / appendix cross-reference. The paper asks “Can LGR2 policies transfer to real-world robots?” and refers to “supplementary Section 9,” but the appendix appears to run A.1–A.7 with no Section 9/A.9 present after careful checking. The real-world description in the main text is also brief, which limits verification. Overall, the presentation should be further checked / improved. - Non-stationarity metric detail. The metric is important (avg. distance between higher-level subgoa

Reviewer 04Rating 2Confidence 4

Strengths

1.1 Motivation is clear and strong: the authors target sparse tasks and the HRL non-stationarity problem directly. 1.2 The experiments are strong with multiple seeds, good baseline coverage, and consistent gains. 1.3 Sim-to-real results are also nice to see (with fairly good results).

Weaknesses

2.1 The claim of invariance appears insufficiently justified. Although r_φ is defined over (s_t, g*, g_t), the state s_t itself depends on the evolving low-level policy, implying an indirect dependence on π_L. In effect, the method transforms a sparse high-level signal into a denser, language-guided shaping signal, which likely enhances learning stability but does not constitute true invariance. 2.2 The paper references a sim-to-real video in the supplementary, but no supplementary material was

Reviewer 05Rating 0Confidence 4

Strengths

1. Thorough related works section.

Weaknesses

1. Missing description of what subgoals are for all environments. I found some information in the Appendix A.4.1 and A.4.2 but it is missing for Franka kitchen. 2. Tasks are too simple. Maybe tasks like OGBench, or tasks that are more long-horizon, would be better suited for this HRL setup. 3. Results are not convincing, due to lack of baselines and more complex tasks. 4. Writing is a bit confusing at times, and could be more clearly organized. For instance, the experiments section is a bit conf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics