Multi-Step First: A Lightweight Deep Reinforcement Learning Strategy for Robust Continuous Control with Partial Observability

Lingheng Meng; Rob Gorbet; Michael Burke; Dana Kuli\'c

arXiv:2209.04999·cs.RO·March 24, 2026

Multi-Step First: A Lightweight Deep Reinforcement Learning Strategy for Robust Continuous Control with Partial Observability

Lingheng Meng, Rob Gorbet, Michael Burke, Dana Kuli\'c

PDF

1 Repo

TL;DR

This paper demonstrates that in partially observable continuous control tasks, PPO with multi-step bootstrapping outperforms other algorithms like TD3 and SAC, which can be improved with multi-step targets.

Contribution

It reveals the robustness advantage of PPO under partial observability and shows how multi-step targets enhance TD3 and SAC performance in such settings.

Findings

01

PPO outperforms TD3 and SAC in POMDPs.

02

Multi-step bootstrapping stabilizes PPO.

03

Multi-step targets improve TD3 and SAC robustness.

Abstract

Deep Reinforcement Learning (DRL) has made considerable advances in simulated and physical robot control tasks, especially when problems admit a fully observed Markov Decision Process (MDP) formulation. When observations only partially capture the underlying state, the problem becomes a Partially Observable MDP (POMDP), and performance rankings between algorithms can change. We empirically compare Proximal Policy Optimization (PPO), Twin Delayed Deep Deterministic Policy Gradient (TD3), and Soft Actor-Critic (SAC) on representative POMDP variants of continuous-control benchmarks. Contrary to widely reported MDP results where TD3 and SAC typically outperform PPO, we observe an inversion: PPO attains higher robustness under partial observability. We attribute this to the stabilizing effect of multi-step bootstrapping. Furthermore, incorporating multi-step targets into TD3 (MTD3) and SAC…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

linghengmeng/m_rl_pomdp
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsConvolution · Target Policy Smoothing · Clipped Double Q-learning · Average Pooling · Dilated Convolution · 1x1 Convolution · Entropy Regularization · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Global Average Pooling