Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach

Henrique Don\^ancio; Antoine Barrier; Leah F. South; Florence Forbes

arXiv:2410.12598·cs.LG·October 9, 2025

Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach

Henrique Don\^ancio, Antoine Barrier, Leah F. South, Florence Forbes

PDF

Open Access 3 Reviews

TL;DR

This paper introduces LRRL, a meta-learning method that dynamically adjusts the learning rate in deep reinforcement learning based on policy performance, improving stability and results over traditional schedulers.

Contribution

The paper presents LRRL, a novel meta-learning approach for adaptive learning rate selection in deep RL, outperforming standard decay schedulers and fixed rates.

Findings

01

LRRL achieves competitive or superior performance on Atari and MuJoCo benchmarks.

02

LRRL remains robust even with candidate rates that cause divergence.

03

Dynamic adjustment improves training stability and efficiency.

Abstract

In deep Reinforcement Learning (RL), the learning rate critically influences both stability and performance, yet its optimal value shifts during training as the environment and policy evolve. Standard decay schedulers assume monotonic convergence and often misalign with these dynamics, leading to premature or delayed adjustments. We introduce LRRL, a meta-learning approach that dynamically selects the learning rate based on policy performance rather than training steps. LRRL adaptively favors rates that improve returns, remaining robust even when the candidate set includes values that individually cause divergence. Across Atari and MuJoCo benchmarks, LRRL achieves performance competitive with or superior to tuned baselines and standard schedulers. Our findings position LRRL as a practical solution for adapting to non-stationary objectives in deep RL.

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 2

Strengths

* The paper tests two codebases, two environment suites (Atari and MuJoCo), three base RL algorithms (DQN, IQN, and PPO), and three base optimizers (SGD, Adam, and RMSprop) * Also tests six stationary, non-convex optimization problems * To my limited knowledge, the approach is somewhat novel

Weaknesses

* I think it overclaims the practical benefits. For example, "LRRL ... [reduces] tuning effort while remaining competitive or superior to the best fixed choice", but I think the empirical evidence for this is not that convincing. I think this overclaiming is a big weakness. For example, Fig. 8 shows the Exp3 learning rate is sensitive (which the paper itself notes). * Few seeds (5-10) * Adds 4 hyperparameters (alpha, delta, j, kappa) * I am pretty sure "half of a standard deviation" is a nonstan

Reviewer 02Rating 4Confidence 3

Strengths

The proposed algorithm boosts the performance of a baseline RL algorithm without a significant computational overhead. Since the proposed algorithm is based on Exp3, an algorithm for a finite arm bandit, the computational overhead is significantly low compared to other meta-gradient methods.

Weaknesses

- The goal is unclear. - A heuristic definition of reward (Line 206) for learning rate tuning is not theoretically motivated well. - Provided empirical results do not seem to be convincing. # The Goal is Unclear The research question asked in Introduction is the following: Can we adapt the learning rate dynamically based on the agent’s performance, rather than relying on training progress or gradient-based heuristics? There would be many possible ways to adapt the learning rate dynamically ba

Reviewer 03Rating 2Confidence 4

Strengths

- In summary, this paper investigates the hyperparameter optimization (HPO) problem in deep RL. This is a promising area, and I really appreciate it. However, the whole paper focuses on optimizing the learning rate solely, which significantly restricts its landscape and scalability. Moreover, it seems that LRRL can only handle categorical HP values rather than continuous HP space. - Bandit-based HPO approaches have been widely studied, such as Hyperband [1] and ULTHO [2]. Could the authors clar

Weaknesses

See the comments above.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics