Improving Policy Exploitation in Online Reinforcement Learning with Instant Retrospect Action

Gong Gao; Weidong Zhao; Xianhui Liu; Ning Jia

arXiv:2601.19720·cs.LG·February 18, 2026

Improving Policy Exploitation in Online Reinforcement Learning with Instant Retrospect Action

Gong Gao, Weidong Zhao, Xianhui Liu, Ning Jia

PDF

Open Access

TL;DR

This paper introduces the Instant Retrospect Action (IRA) algorithm, which enhances policy exploitation in online reinforcement learning by improving representation learning, policy constraints, and update frequency, leading to better efficiency and performance.

Contribution

The paper proposes IRA, combining RDE, GAG, and IPU mechanisms to accelerate policy updates and improve learning in online RL, addressing exploration and overestimation issues.

Findings

01

IRA significantly improves learning efficiency on MuJoCo tasks.

02

IRA achieves higher final performance compared to baseline algorithms.

03

Early-stage conservatism reduces overestimation bias.

Abstract

Existing value-based online reinforcement learning (RL) algorithms suffer from slow policy exploitation due to ineffective exploration and delayed policy updates. To address these challenges, we propose an algorithm called Instant Retrospect Action (IRA). Specifically, we propose Q-Representation Discrepancy Evolution (RDE) to facilitate Q-network representation learning, enabling discriminative representations for neighboring state-action pairs. In addition, we adopt an explicit method to policy constraints by enabling Greedy Action Guidance (GAG). This is achieved through backtracking historical actions, which effectively enhances the policy update process. Our proposed method relies on providing the learning algorithm with accurate $k$ -nearest-neighbor action value estimates and learning to design a fast-adaptable policy through policy constraints. We further propose the Instant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Age of Information Optimization