Careful at Estimation and Bold at Exploration
Xing Chen, Yijun Liu, Zhaogeng Liu, Hechang Chen, Hengshuai Yao, Yi, Chang

TL;DR
This paper introduces a novel exploration strategy for deterministic policy reinforcement learning that mitigates aimless exploration and policy divergence by combining a double-Q function framework with a surrogate policy, leading to improved performance on Mujoco benchmarks.
Contribution
It proposes a new exploration method based on greedy Q softmax updates and surrogate policy learning, addressing key issues in policy-based exploration for DPRL.
Findings
Superior performance on Mujoco benchmarks.
Effective in complex environments like Humanoid.
Outperforms previous state-of-the-art methods.
Abstract
Exploration strategies in continuous action space are often heuristic due to the infinite actions, and these kinds of methods cannot derive a general conclusion. In prior work, it has been shown that policy-based exploration is beneficial for continuous action space in deterministic policy reinforcement learning(DPRL). However, policy-based exploration in DPRL has two prominent issues: aimless exploration and policy divergence, and the policy gradient for exploration is only sometimes helpful due to inaccurate estimation. Based on the double-Q function framework, we introduce a novel exploration strategy to mitigate these issues, separate from the policy gradient. We first propose the greedy Q softmax update schema for Q value update. The expected Q value is derived by weighted summing the conservative Q value over actions, and the weight is the corresponding greedy Q value. Greedy Q…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsSoftmax
