Improving Policy Exploitation in Online Reinforcement Learning with Instant Retrospect Action
Gong Gao, Weidong Zhao, Xianhui Liu, Ning Jia

TL;DR
This paper introduces the Instant Retrospect Action (IRA) algorithm, which enhances policy exploitation in online reinforcement learning by improving representation learning, policy constraints, and update frequency, leading to better efficiency and performance.
Contribution
The paper proposes IRA, combining RDE, GAG, and IPU mechanisms to accelerate policy updates and improve learning in online RL, addressing exploration and overestimation issues.
Findings
IRA significantly improves learning efficiency on MuJoCo tasks.
IRA achieves higher final performance compared to baseline algorithms.
Early-stage conservatism reduces overestimation bias.
Abstract
Existing value-based online reinforcement learning (RL) algorithms suffer from slow policy exploitation due to ineffective exploration and delayed policy updates. To address these challenges, we propose an algorithm called Instant Retrospect Action (IRA). Specifically, we propose Q-Representation Discrepancy Evolution (RDE) to facilitate Q-network representation learning, enabling discriminative representations for neighboring state-action pairs. In addition, we adopt an explicit method to policy constraints by enabling Greedy Action Guidance (GAG). This is achieved through backtracking historical actions, which effectively enhances the policy update process. Our proposed method relies on providing the learning algorithm with accurate -nearest-neighbor action value estimates and learning to design a fast-adaptable policy through policy constraints. We further propose the Instant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Age of Information Optimization
