TL;DR
This paper demonstrates that in partially observable continuous control tasks, PPO with multi-step bootstrapping outperforms other algorithms like TD3 and SAC, which can be improved with multi-step targets.
Contribution
It reveals the robustness advantage of PPO under partial observability and shows how multi-step targets enhance TD3 and SAC performance in such settings.
Findings
PPO outperforms TD3 and SAC in POMDPs.
Multi-step bootstrapping stabilizes PPO.
Multi-step targets improve TD3 and SAC robustness.
Abstract
Deep Reinforcement Learning (DRL) has made considerable advances in simulated and physical robot control tasks, especially when problems admit a fully observed Markov Decision Process (MDP) formulation. When observations only partially capture the underlying state, the problem becomes a Partially Observable MDP (POMDP), and performance rankings between algorithms can change. We empirically compare Proximal Policy Optimization (PPO), Twin Delayed Deep Deterministic Policy Gradient (TD3), and Soft Actor-Critic (SAC) on representative POMDP variants of continuous-control benchmarks. Contrary to widely reported MDP results where TD3 and SAC typically outperform PPO, we observe an inversion: PPO attains higher robustness under partial observability. We attribute this to the stabilizing effect of multi-step bootstrapping. Furthermore, incorporating multi-step targets into TD3 (MTD3) and SAC…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsConvolution · Target Policy Smoothing · Clipped Double Q-learning · Average Pooling · Dilated Convolution · 1x1 Convolution · Entropy Regularization · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Global Average Pooling
