Policy Optimization with Smooth Guidance Learned from State-Only Demonstrations
Guojian Wang, Faguo Wu, Xiao Zhang, Tianyuan Chen

TL;DR
This paper introduces POSG, a novel reinforcement learning algorithm that uses state-only demonstrations to improve learning efficiency and control performance in sparse-reward environments, reducing reliance on high-quality action data.
Contribution
The paper proposes POSG, an efficient method leveraging state-only demonstrations with a trajectory importance mechanism to guide policy optimization in sparse-reward settings.
Findings
POSG outperforms baselines in control performance.
Faster convergence in four benchmark environments.
Effective use of state-only demonstrations for guidance.
Abstract
The sparsity of reward feedback remains a challenging problem in online deep reinforcement learning (DRL). Previous approaches have utilized offline demonstrations to achieve impressive results in multiple hard tasks. However, these approaches place high demands on demonstration quality, and obtaining expert-like actions is often costly and unrealistic. To tackle these problems, we propose a simple and efficient algorithm called Policy Optimization with Smooth Guidance (POSG), which leverages a small set of state-only demonstrations (where expert action information is not included in demonstrations) to indirectly make approximate and feasible long-term credit assignments and facilitate exploration. Specifically, we first design a trajectory-importance evaluation mechanism to determine the quality of the current trajectory against demonstrations. Then, we introduce a guidance reward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning
MethodsSparse Evolutionary Training · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
