Towards Cost-Effective Reward Guided Text Generation

Ahmad Rashid; Ruotian Wu; Rongqi Fan; Hongliang Li; Agustinus Kristiadi; Pascal Poupart

arXiv:2502.04517·cs.LG·July 8, 2025

Towards Cost-Effective Reward Guided Text Generation

Ahmad Rashid, Ruotian Wu, Rongqi Fan, Hongliang Li, Agustinus Kristiadi, Pascal Poupart

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel reward model architecture for reward-guided text generation that enables faster inference by scoring all candidate tokens simultaneously with a single call, improving efficiency and performance.

Contribution

The paper proposes a new reward model trained with a Bradley-Terry loss for efficient, step-wise preference scoring during text generation, reducing inference overhead.

Findings

01

Faster inference compared to existing RGTG methods

02

Requires fewer calls to the reward model during generation

03

Performs competitively with previous RGTG and RLHF approaches

Abstract

Reward-guided text generation (RGTG) has emerged as a viable alternative to offline reinforcement learning from human feedback (RLHF). RGTG methods can align baseline language models to human preferences without further training like in standard RLHF methods. However, they rely on a reward model to score each candidate token generated by the language model at inference, incurring significant test-time overhead. Additionally, the reward model is usually only trained to score full sequences, which can lead to sub-optimal choices for partial sequences. In this work, we present a novel reward model architecture that is trained, using a Bradley-Terry loss, to prefer the optimal expansion of a sequence with just a \emph{single call} to the reward model at each step of the generation process. That is, a score for all possible candidate tokens is generated simultaneously, leading to efficient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Towards Cost-Effective Reward Guided Text Generation· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsALIGN