TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing
Yuanpeng Li, Gefei Lin, Annie Qu, Rui Miao

TL;DR
This paper introduces TOPPO, an improved on-policy reinforcement learning method that addresses critic gradient issues in multi-task settings, outperforming existing off-policy algorithms with fewer resources.
Contribution
TOPPO reformulates PPO with Critic Balancing to enhance gradient conditioning and balance learning across tasks, challenging the dominance of off-policy methods in multi-task RL.
Findings
TOPPO outperforms SAC and ARS baselines on Meta-World+ benchmark.
TOPPO achieves better early and full-budget performance than strong SAC baselines.
Ablation studies validate the effectiveness of TOPPO's modules.
Abstract
Soft Actor-Critic (SAC) and its variants dominate Multi-Task Reinforcement Learning (MTRL) due to their off-policy sample efficiency, while on-policy methods such as Proximal Policy Optimization (PPO) remain underexplored. We diagnose that PPO in MTRL suffers from a previously overlooked issue: critic-side gradient ill-conditioning, which may cause tail tasks to stall while easy tasks dominate the value function's updates. To address this, we propose TOPPO (Tail-Optimized PPO), a reformulation of PPO via Critic Balancing -- a set of modules that improve gradient conditioning and balance learning dynamics across tasks. Unlike prior approaches that rely on modular architectures or large models, TOPPO targets the optimization bottleneck within PPO itself. Empirically, TOPPO achieves stronger mean and tail-task performance than published SAC-family and ARS-family baselines while using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
