Careful at Estimation and Bold at Exploration

Xing Chen; Yijun Liu; Zhaogeng Liu; Hechang Chen; Hengshuai Yao; Yi; Chang

arXiv:2308.11348·cs.LG·August 23, 2023

Careful at Estimation and Bold at Exploration

Xing Chen, Yijun Liu, Zhaogeng Liu, Hechang Chen, Hengshuai Yao, Yi, Chang

PDF

Open Access

TL;DR

This paper introduces a novel exploration strategy for deterministic policy reinforcement learning that mitigates aimless exploration and policy divergence by combining a double-Q function framework with a surrogate policy, leading to improved performance on Mujoco benchmarks.

Contribution

It proposes a new exploration method based on greedy Q softmax updates and surrogate policy learning, addressing key issues in policy-based exploration for DPRL.

Findings

01

Superior performance on Mujoco benchmarks.

02

Effective in complex environments like Humanoid.

03

Outperforms previous state-of-the-art methods.

Abstract

Exploration strategies in continuous action space are often heuristic due to the infinite actions, and these kinds of methods cannot derive a general conclusion. In prior work, it has been shown that policy-based exploration is beneficial for continuous action space in deterministic policy reinforcement learning(DPRL). However, policy-based exploration in DPRL has two prominent issues: aimless exploration and policy divergence, and the policy gradient for exploration is only sometimes helpful due to inaccurate estimation. Based on the double-Q function framework, we introduce a novel exploration strategy to mitigate these issues, separate from the policy gradient. We first propose the greedy Q softmax update schema for Q value update. The expected Q value is derived by weighted summing the conservative Q value over actions, and the weight is the corresponding greedy Q value. Greedy Q…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics

MethodsSoftmax