GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment

Yuancheng Xu; Udari Madhushani Sehwag; Alec Koppel; Sicheng Zhu; Bang An; Furong Huang; Sumitra Ganesh

arXiv:2410.08193·cs.CL·July 16, 2025

GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment

Yuancheng Xu, Udari Madhushani Sehwag, Alec Koppel, Sicheng Zhu, Bang An, Furong Huang, Sumitra Ganesh

PDF

Open Access 1 Repo 1 Models 3 Reviews

TL;DR

GenARM introduces a novel autoregressive reward model for test-time alignment of large language models, enabling efficient, flexible, and multi-objective alignment without retraining, matching the performance of traditional training methods.

Contribution

The paper proposes GenARM, a new reward parametrization for autoregressive reward modeling, with theoretical guarantees and practical advantages over existing test-time alignment approaches.

Findings

01

GenARM outperforms prior test-time alignment methods.

02

It matches training-time alignment performance.

03

Supports multi-objective and weak-to-strong guidance.

Abstract

Large Language Models (LLMs) exhibit impressive capabilities but require careful alignment with human preferences. Traditional training-time methods finetune LLMs using human preference datasets but incur significant training costs and require repeated training to handle diverse user preferences. Test-time alignment methods address this by using reward models (RMs) to guide frozen LLMs without retraining. However, existing test-time approaches rely on trajectory-level RMs which are designed to evaluate complete responses, making them unsuitable for autoregressive text generation that requires computing next-token rewards from partial responses. To address this, we introduce GenARM, a test-time alignment approach that leverages the Autoregressive Reward Model--a novel reward parametrization designed to predict next-token rewards for efficient and effective autoregressive generation.…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- The GenARM's design is relatively simple and straightforward, which allows the model to learn the reward effectively. - Compared with previous testing time alignment methods that naively used trajectory-based RM to guide the frozen LLMs to generate samples of target distributions, GenARM can provide relatively more accurate token-level guidance and signals with only a small costs to train the GenARM model; - Empirical experiments shows that GenARM achieves much better performance without trai

Weaknesses

- Compared with training time-based method, there is still gap of the current methods and DPO, while theoretically the authors show that the method can approximate the target optimal distribution; - The evaluation experiments are relatively simple, only conducted on simple benchmarks; - It would be good to compare [1] since this is also closely related to testing-time alignment. - It would be good to add some discussion of previous token-level reward methods or literature, e.g. [2] [1] Casca

Reviewer 02Rating 6Confidence 4

Strengths

- The authors identify an important pain point of existing work, such as ARGS, that performs inference-time alignment on a token level without being trained for it. - The theoretical analysis provides an important insight into the capacity of the parameterization the authors chose. - The paper is written in a clear and easy-to-follow way.

Weaknesses

- While the focus on the downstream application of inference-time alignment is nice and leads to powerful results, I'm expecting work that suggests a new way to train a reward model to first evaluate how well the reward model itself is. Do the proposed reward models achieve better prediction capabilities than trajectory-level reward models? same level? I'm suggesting using rewardbench [1] for such evaluation, but even a classic train-test split on some preference dataset will be interesting. -

Reviewer 03Rating 6Confidence 4

Strengths

1. The idea of addressing the mismatch between trajectory-level RMs and autoregressive text generation is promising, and have been gaining attention in the community. 2. The proposed method is relatively versatile in experiments, allowing both standard generation, weak-to-strong guidance, and multi-objective alignment.

Weaknesses

1. Autoregressive Reward Model, trained by aligning the accumulated token-level rewards over a full response with the ground truth preference ordering, has been clearly proposed in the literature, e.g., [1,2] and the reference therein. Further, the idea of per-step-reward guided generation has been presented in [3] and the reference therein. The authors ought to have a adequate citation, discussion, and ideally comparison with these prior works. Otherwise, the contribution of this work will be s

Code & Models

Repositories

Yuancheng-Xu/GenARM
pytorch

Models

🤗
YuanchengXu/AutoregressiveRM-tulu2-7b
model· 4 dl
4 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Educational Technology and Assessment · Online Learning and Analytics