Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach
Henrique Don\^ancio, Antoine Barrier, Leah F. South, Florence Forbes

TL;DR
This paper introduces LRRL, a meta-learning method that dynamically adjusts the learning rate in deep reinforcement learning based on policy performance, improving stability and results over traditional schedulers.
Contribution
The paper presents LRRL, a novel meta-learning approach for adaptive learning rate selection in deep RL, outperforming standard decay schedulers and fixed rates.
Findings
LRRL achieves competitive or superior performance on Atari and MuJoCo benchmarks.
LRRL remains robust even with candidate rates that cause divergence.
Dynamic adjustment improves training stability and efficiency.
Abstract
In deep Reinforcement Learning (RL), the learning rate critically influences both stability and performance, yet its optimal value shifts during training as the environment and policy evolve. Standard decay schedulers assume monotonic convergence and often misalign with these dynamics, leading to premature or delayed adjustments. We introduce LRRL, a meta-learning approach that dynamically selects the learning rate based on policy performance rather than training steps. LRRL adaptively favors rates that improve returns, remaining robust even when the candidate set includes values that individually cause divergence. Across Atari and MuJoCo benchmarks, LRRL achieves performance competitive with or superior to tuned baselines and standard schedulers. Our findings position LRRL as a practical solution for adapting to non-stationary objectives in deep RL.
Peer Reviews
Decision·Submitted to ICLR 2026
* The paper tests two codebases, two environment suites (Atari and MuJoCo), three base RL algorithms (DQN, IQN, and PPO), and three base optimizers (SGD, Adam, and RMSprop) * Also tests six stationary, non-convex optimization problems * To my limited knowledge, the approach is somewhat novel
* I think it overclaims the practical benefits. For example, "LRRL ... [reduces] tuning effort while remaining competitive or superior to the best fixed choice", but I think the empirical evidence for this is not that convincing. I think this overclaiming is a big weakness. For example, Fig. 8 shows the Exp3 learning rate is sensitive (which the paper itself notes). * Few seeds (5-10) * Adds 4 hyperparameters (alpha, delta, j, kappa) * I am pretty sure "half of a standard deviation" is a nonstan
The proposed algorithm boosts the performance of a baseline RL algorithm without a significant computational overhead. Since the proposed algorithm is based on Exp3, an algorithm for a finite arm bandit, the computational overhead is significantly low compared to other meta-gradient methods.
- The goal is unclear. - A heuristic definition of reward (Line 206) for learning rate tuning is not theoretically motivated well. - Provided empirical results do not seem to be convincing. # The Goal is Unclear The research question asked in Introduction is the following: Can we adapt the learning rate dynamically based on the agent’s performance, rather than relying on training progress or gradient-based heuristics? There would be many possible ways to adapt the learning rate dynamically ba
- In summary, this paper investigates the hyperparameter optimization (HPO) problem in deep RL. This is a promising area, and I really appreciate it. However, the whole paper focuses on optimizing the learning rate solely, which significantly restricts its landscape and scalability. Moreover, it seems that LRRL can only handle categorical HP values rather than continuous HP space. - Bandit-based HPO approaches have been widely studied, such as Hyperband [1] and ULTHO [2]. Could the authors clar
See the comments above.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
