A Critical Look At Tokenwise Reward-Guided Text Generation
Ahmad Rashid, Ruotian Wu, Julia Grosse, Agustinus Kristiadi, Pascal Poupart

TL;DR
This paper critically examines tokenwise reward-guided text generation, revealing limitations of current methods and proposing a new reward model trained on partial sequences that improves generation quality without large-scale fine-tuning.
Contribution
It introduces a novel Bradley-Terry reward model trained on partial sequences and analyzes its properties, outperforming previous heuristic RGTG methods.
Findings
The new reward model is compatible with partial sequence scoring.
The proposed policy is proportional to the ratio of two RLHF policies.
Our method performs similarly to strong offline baselines without large-scale fine-tuning.
Abstract
Large language models (LLMs) can be improved by aligning with human preferences through fine-tuning -- the so-called reinforcement learning from human feedback (RLHF). However, the cost of fine-tuning an LLM is prohibitive for many users. Due to their ability to bypass LLM fine-tuning, prediction-time tokenwise reward-guided text generation (RGTG) methods have recently been proposed. They use a reward model trained on full sequences to score partial sequences during decoding in a bid to steer the generation towards sequences with high rewards. However, these methods have so far been only heuristically motivated and poorly analyzed. In this work, we show that reward models trained on full sequences are not compatible with scoring partial sequences. To alleviate this, we propose to train a Bradley-Terry reward model on partial sequences explicitly, and autoregressively sample from the…
Peer Reviews
Decision·Submitted to ICLR 2025
- This work gives good recap on RLHF and DPO for people who are not familiar with the fields can get to connect the dots. - This work show that by a simple approach of dropping some follow sequence and train on the partial rewards can help inference time generation quality (by LLM-judge).
I think this reward model training on partial inputs are not novel as there are quite a few works of using partial observation to train reward models. For example, Learning to Rank Generation with Pairwise Partial Rewards (https://aclanthology.org/2023.emnlp-main.371.pdf), Teacher Forcing Recovers Reward Functions for Text Generation (https://openreview.net/pdf?id=1_gypPuWUC3), Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model (https://arxiv.org/p
- The paper proposes a straightforward method for training RMs on partial sequences. - The presentation is clear and well-structured, covering classic RLHF concepts, DPO, and decode-time RGTG. - The experiments evaluating performance and additional inference costs are thorough and balanced.
- Beyond theoretical analysis, the claim that full reward models produce arbitrary rewards would be stronger with empirical evidence, such as human experiments. - For instance, an ablation study demonstrating the sensitivity of full RMs to varying sequence lengths could strengthen this argument. - To validate that the token-level RMs (using partial sequences) are more effective than full sequence RMs, it would be beneficial to train PPO and compare it against full sequence RMs. One might expec
The theoretical analysis is great and the empirical evaluation is comprehensive. It is important to improve the text generation without expensive fine-tuning. The proposed method gave a practical approach with strong empirical experimental results.
The experiments only focus on automated metrics and GPT-4 evaluation. It can benefit from some human evaluation, even on a small scale. It can provide additional validation of the claimed improvements.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
