GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment
Yuancheng Xu, Udari Madhushani Sehwag, Alec Koppel, Sicheng Zhu, Bang An, Furong Huang, Sumitra Ganesh

TL;DR
GenARM introduces a novel autoregressive reward model for test-time alignment of large language models, enabling efficient, flexible, and multi-objective alignment without retraining, matching the performance of traditional training methods.
Contribution
The paper proposes GenARM, a new reward parametrization for autoregressive reward modeling, with theoretical guarantees and practical advantages over existing test-time alignment approaches.
Findings
GenARM outperforms prior test-time alignment methods.
It matches training-time alignment performance.
Supports multi-objective and weak-to-strong guidance.
Abstract
Large Language Models (LLMs) exhibit impressive capabilities but require careful alignment with human preferences. Traditional training-time methods finetune LLMs using human preference datasets but incur significant training costs and require repeated training to handle diverse user preferences. Test-time alignment methods address this by using reward models (RMs) to guide frozen LLMs without retraining. However, existing test-time approaches rely on trajectory-level RMs which are designed to evaluate complete responses, making them unsuitable for autoregressive text generation that requires computing next-token rewards from partial responses. To address this, we introduce GenARM, a test-time alignment approach that leverages the Autoregressive Reward Model--a novel reward parametrization designed to predict next-token rewards for efficient and effective autoregressive generation.…
Peer Reviews
Decision·ICLR 2025 Poster
- The GenARM's design is relatively simple and straightforward, which allows the model to learn the reward effectively. - Compared with previous testing time alignment methods that naively used trajectory-based RM to guide the frozen LLMs to generate samples of target distributions, GenARM can provide relatively more accurate token-level guidance and signals with only a small costs to train the GenARM model; - Empirical experiments shows that GenARM achieves much better performance without trai
- Compared with training time-based method, there is still gap of the current methods and DPO, while theoretically the authors show that the method can approximate the target optimal distribution; - The evaluation experiments are relatively simple, only conducted on simple benchmarks; - It would be good to compare [1] since this is also closely related to testing-time alignment. - It would be good to add some discussion of previous token-level reward methods or literature, e.g. [2] [1] Casca
- The authors identify an important pain point of existing work, such as ARGS, that performs inference-time alignment on a token level without being trained for it. - The theoretical analysis provides an important insight into the capacity of the parameterization the authors chose. - The paper is written in a clear and easy-to-follow way.
- While the focus on the downstream application of inference-time alignment is nice and leads to powerful results, I'm expecting work that suggests a new way to train a reward model to first evaluate how well the reward model itself is. Do the proposed reward models achieve better prediction capabilities than trajectory-level reward models? same level? I'm suggesting using rewardbench [1] for such evaluation, but even a classic train-test split on some preference dataset will be interesting. -
1. The idea of addressing the mismatch between trajectory-level RMs and autoregressive text generation is promising, and have been gaining attention in the community. 2. The proposed method is relatively versatile in experiments, allowing both standard generation, weak-to-strong guidance, and multi-objective alignment.
1. Autoregressive Reward Model, trained by aligning the accumulated token-level rewards over a full response with the ground truth preference ordering, has been clearly proposed in the literature, e.g., [1,2] and the reference therein. Further, the idea of per-step-reward guided generation has been presented in [3] and the reference therein. The authors ought to have a adequate citation, discussion, and ideally comparison with these prior works. Otherwise, the contribution of this work will be s
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Educational Technology and Assessment · Online Learning and Analytics
