SoftTreeMax: Policy Gradient with Tree Search
Gal Dalal, Assaf Hallak, Shie Mannor, Gal Chechik

TL;DR
SoftTreeMax integrates tree search with policy gradient methods, significantly reducing gradient variance and improving sample efficiency, leading to faster and more effective learning in control tasks like Atari games.
Contribution
It introduces the first tree-search-based policy gradient method, reducing gradient variance and enhancing sample efficiency over traditional approaches.
Findings
Reduces gradient variance by three orders of magnitude.
Achieves up to 5x faster performance than distributed PPO on Atari.
Demonstrates improved sample efficiency and faster convergence.
Abstract
Policy-gradient methods are widely used for learning control policies. They can be easily distributed to multiple workers and reach state-of-the-art results in many domains. Unfortunately, they exhibit large variance and subsequently suffer from high-sample complexity since they aggregate gradients over entire trajectories. At the other extreme, planning methods, like tree search, optimize the policy using single-step transitions that consider future lookahead. These approaches have been mainly considered for value-based algorithms. Planning-based algorithms require a forward model and are computationally intensive at each step, but are more sample efficient. In this work, we introduce SoftTreeMax, the first approach that integrates tree-search into policy gradient. Traditionally, gradients are computed for single state-action pairs. Instead, our tree-based policy structure leverages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Reinforcement Learning in Robotics · Parallel Computing and Optimization Techniques
MethodsEntropy Regularization · Proximal Policy Optimization
