SoftTreeMax: Policy Gradient with Tree Search

Gal Dalal; Assaf Hallak; Shie Mannor; Gal Chechik

arXiv:2209.13966·cs.LG·September 29, 2022

SoftTreeMax: Policy Gradient with Tree Search

Gal Dalal, Assaf Hallak, Shie Mannor, Gal Chechik

PDF

Open Access

TL;DR

SoftTreeMax integrates tree search with policy gradient methods, significantly reducing gradient variance and improving sample efficiency, leading to faster and more effective learning in control tasks like Atari games.

Contribution

It introduces the first tree-search-based policy gradient method, reducing gradient variance and enhancing sample efficiency over traditional approaches.

Findings

01

Reduces gradient variance by three orders of magnitude.

02

Achieves up to 5x faster performance than distributed PPO on Atari.

03

Demonstrates improved sample efficiency and faster convergence.

Abstract

Policy-gradient methods are widely used for learning control policies. They can be easily distributed to multiple workers and reach state-of-the-art results in many domains. Unfortunately, they exhibit large variance and subsequently suffer from high-sample complexity since they aggregate gradients over entire trajectories. At the other extreme, planning methods, like tree search, optimize the policy using single-step transitions that consider future lookahead. These approaches have been mainly considered for value-based algorithms. Planning-based algorithms require a forward model and are computationally intensive at each step, but are more sample efficient. In this work, we introduce SoftTreeMax, the first approach that integrates tree-search into policy gradient. Traditionally, gradients are computed for single state-action pairs. Instead, our tree-based policy structure leverages…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Reinforcement Learning in Robotics · Parallel Computing and Optimization Techniques

MethodsEntropy Regularization · Proximal Policy Optimization