Enhancing PPO with Trajectory-Aware Hybrid Policies
Qisai Liu, Zhanhong Jiang, Hsin-Jung Yang, Mahsa Khosravi, Joshua R., Waite, Soumik Sarkar

TL;DR
This paper introduces HP3O, a hybrid policy optimization method that uses a trajectory replay buffer to improve sample efficiency and reduce variance in PPO, with theoretical guarantees and empirical validation in continuous control tasks.
Contribution
The paper proposes HP3O, a novel hybrid policy optimization algorithm that incorporates a trajectory replay buffer with FIFO strategy to enhance PPO's performance.
Findings
HP3O reduces variance empirically in experiments.
HP3O improves sample efficiency over baseline algorithms.
Theoretical policy improvement guarantees are established.
Abstract
Proximal policy optimization (PPO) is one of the most popular state-of-the-art on-policy algorithms that has become a standard baseline in modern reinforcement learning with applications in numerous fields. Though it delivers stable performance with theoretical policy improvement guarantees, high variance, and high sample complexity still remain critical challenges in on-policy algorithms. To alleviate these issues, we propose Hybrid-Policy Proximal Policy Optimization (HP3O), which utilizes a trajectory replay buffer to make efficient use of trajectories generated by recent policies. Particularly, the buffer applies the "first in, first out" (FIFO) strategy so as to keep only the recent trajectories to attenuate the data distribution drift. A batch consisting of the trajectory with the best return and other randomly sampled ones from the buffer is used for updating the policy networks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTransportation and Mobility Innovations · Mobile Agent-Based Network Management · Optimization and Search Problems
