Don't throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding
Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru, Yejin Choi, Hannaneh, Hajishirzi, Asli Celikyilmaz

TL;DR
This paper introduces PPO-MCTS, a novel decoding algorithm that leverages the value network from PPO to enhance the quality of generated text, demonstrating significant improvements across multiple tasks.
Contribution
It presents a new value-guided decoding method that integrates the PPO value network with MCTS, improving text generation quality over standard PPO decoding.
Findings
PPO-MCTS outperforms standard PPO in text preferability.
The approach reduces mismatch between training and test scoring.
Demonstrates effectiveness across four text generation tasks.
Abstract
Inference-time search algorithms such as Monte-Carlo Tree Search (MCTS) may seem unnecessary when generating natural language text based on state-of-the-art reinforcement learning such as Proximal Policy Optimization (PPO). In this paper, we demonstrate that it is possible to get extra mileage out of PPO by integrating MCTS on top. The key idea is not to throw out the value network, a byproduct of PPO training for evaluating partial output sequences, when decoding text out of the policy network. More concretely, we present a novel value-guided decoding algorithm called PPO-MCTS, which can integrate the value network from PPO to work closely with the policy network during inference-time generation. Compared to prior approaches based on MCTS for controlled text generation, the key strength of our approach is to reduce the fundamental mismatch of the scoring mechanisms of the partial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsEntropy Regularization · Proximal Policy Optimization · Monte-Carlo Tree Search
