Gradual Transition from Bellman Optimality Operator to Bellman Operator in Online Reinforcement Learning

Motoki Omura; Kazuki Ota; Takayuki Osa; Yusuke Mukuta; Tatsuya Harada

arXiv:2506.05968·cs.LG·August 14, 2025

Gradual Transition from Bellman Optimality Operator to Bellman Operator in Online Reinforcement Learning

Motoki Omura, Kazuki Ota, Takayuki Osa, Yusuke Mukuta, Tatsuya Harada

PDF

Open Access 1 Repo

TL;DR

This paper introduces an annealing method that gradually shifts from the Bellman optimality operator to the Bellman operator in actor-critic RL algorithms, improving learning speed and reducing bias in continuous action spaces.

Contribution

The study proposes a novel annealing approach to transition between Bellman operators, enhancing sample efficiency and robustness in continuous action reinforcement learning.

Findings

01

Modeling optimal values accelerates learning.

02

The annealing method reduces overestimation bias.

03

The approach outperforms existing methods in various tasks.

Abstract

For continuous action spaces, actor-critic methods are widely used in online reinforcement learning (RL). However, unlike RL algorithms for discrete actions, which generally model the optimal value function using the Bellman optimality operator, RL algorithms for continuous actions typically model Q-values for the current policy using the Bellman operator. These algorithms for continuous actions rely exclusively on policy updates for improvement, which often results in low sample efficiency. This study examines the effectiveness of incorporating the Bellman optimality operator into actor-critic frameworks. Experiments in a simple environment show that modeling optimal values accelerates learning but leads to overestimation bias. To address this, we propose an annealing approach that gradually transitions from the Bellman optimality operator to the Bellman operator, thereby accelerating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

motokiomura/annealed-q-learning
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Evolutionary Algorithms and Applications

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Global Average Pooling · 1x1 Convolution · Convolution · Dense Connections · Switchable Atrous Convolution · Clipped Double Q-learning · Experience Replay · Adam · Target Policy Smoothing