Policy Gradient with Tree Expansion

Gal Dalal; Assaf Hallak; Gugan Thoppe; Shie Mannor; Gal Chechik

arXiv:2301.13236·cs.LG·May 27, 2025

Policy Gradient with Tree Expansion

Gal Dalal, Assaf Hallak, Gugan Thoppe, Shie Mannor, Gal Chechik

PDF

Open Access 1 Video

TL;DR

This paper introduces SoftTreeMax, a planning-based extension of softmax for policy gradients, which significantly reduces gradient variance and improves sample efficiency in reinforcement learning.

Contribution

It proposes SoftTreeMax with theoretical variance bounds and demonstrates practical benefits using GPU-based tree expansion in Atari games.

Findings

01

Reduces gradient variance by three orders of magnitude.

02

Improves sample complexity over distributed PPO.

03

Provides theoretical bounds on gradient bias and variance.

Abstract

Policy gradient methods are notorious for having a large variance and high sample complexity. To mitigate this, we introduce SoftTreeMax -- a generalization of softmax that employs planning. In SoftTreeMax, we extend the traditional logits with the multi-step discounted cumulative reward, topped with the logits of future states. We analyze SoftTreeMax and explain how tree expansion helps to reduce its gradient variance. We prove that the variance depends on the chosen tree-expansion policy. Specifically, we show that the closer the induced transitions are to being state-independent, the stronger the variance decay. With approximate forward models, we prove that the resulting gradient bias diminishes with the approximation error while retaining the same variance reduction. Ours is the first result to bound the gradient bias for an approximate model. In a practical implementation of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Policy Gradient with Tree Expansion· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Machine Learning and Algorithms

MethodsEntropy Regularization · Proximal Policy Optimization · Softmax