Accelerating Proximal Policy Optimization Learning Using Task Prediction for Solving Environments with Delayed Rewards
Ahmad Ahmad, Mehdi Kermanshah, Kevin Leahy, Zachary Serlin, Ho Chit, Siu, Makai Mann, Cristian-Ioan Vasile, Roberto Tron, Calin Belta

TL;DR
This paper enhances Proximal Policy Optimization for reinforcement learning with delayed rewards by integrating offline expert data and temporal logic-based reward shaping, leading to faster and more effective learning.
Contribution
It introduces a hybrid PPO architecture with reward shaping via TWTL, providing theoretical guarantees and improved empirical performance in delayed reward environments.
Findings
Faster learning speed in inverted pendulum and lunar lander environments.
Improved final performance over standard PPO.
Theoretical proof of performance bounds and reward preservation.
Abstract
In this paper, we tackle the challenging problem of delayed rewards in reinforcement learning (RL). While Proximal Policy Optimization (PPO) has emerged as a leading Policy Gradient method, its performance can degrade under delayed rewards. We introduce two key enhancements to PPO: a hybrid policy architecture that combines an offline policy (trained on expert demonstrations) with an online PPO policy, and a reward shaping mechanism using Time Window Temporal Logic (TWTL). The hybrid architecture leverages offline data throughout training while maintaining PPO's theoretical guarantees. Building on the monotonic improvement framework of Trust Region Policy Optimization (TRPO), we prove that our approach ensures improvement over both the offline policy and previous iterations, with a bounded performance gap of , where is the mixing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · IoT and Edge/Fog Computing · Software Reliability and Analysis Research
MethodsEntropy Regularization · Proximal Policy Optimization · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
