UniARM: Towards a Unified Autoregressive Reward Model for Multi-Objective Test-Time Alignment

Hongyan Xie; Yikun Ban; Ruiyu Fang; Zixuan Huang; Deqing Wang; Jianxin Li; Yitong Yao; Chao Wang; Shuangyong Song

arXiv:2602.09538·cs.CL·February 11, 2026

UniARM: Towards a Unified Autoregressive Reward Model for Multi-Objective Test-Time Alignment

Hongyan Xie, Yikun Ban, Ruiyu Fang, Zixuan Huang, Deqing Wang, Jianxin Li, Yitong Yao, Chao Wang, Shuangyong Song

PDF

Open Access 3 Reviews

TL;DR

UniARM introduces a unified autoregressive reward model that effectively aligns large language models with multiple human preferences at test time, reducing parameter complexity and improving preference trade-off control.

Contribution

The paper proposes UniARM, a novel framework that models multiple preferences jointly in a single parameter space, addressing limitations of previous independent or entangled approaches.

Findings

01

Unified modeling of preferences improves alignment accuracy.

02

Shared feature extraction reduces model complexity.

03

Enhanced control over preference trade-offs during inference.

Abstract

Multi-objective alignment aims to align LLM responses with multiple human preference objectives. Among existing methods, guiding the generation of frozen LLMs through autoregressive reward models (ARMs) to accomplish multi-objective test-time alignment is a low-cost solution. However, these methods typically rely on independent parameters for each preference objective, either by training ARMs independently across preference dimensions, which neglects interactions among preference features, or by training a single ARM with separate feature extraction modules for each preference, which can cause feature entanglement. Both strategies can result in misalignment between generated outputs and user preferences. To address this limitation, we propose Preference-Modulated \& Shared Low-Rank Adaptation (MoSLoRA) for ARM training, which first extracts shared features via a preference-agnostic…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 4

Strengths

1. The paper is well-written and easy to understand. 2. The proposed method performs better while being more parameter-efficient than baselines. I believe this gain outweighs the slightly weak originality of the proposed method (weak originality as it consists in a minor change to the low-rank adapter used in PARM plus a regularizer). 3. I think the experiments are very well designed. Enough meaningful baselines are included. The ablation experiments provide clear clues as to how each component

Weaknesses

1. Although the proposed method is empirically effective, there is a lack of intuitive or theoretical explanation of where the effectiveness (i.e., the better Pareto front) comes from. Can the authors explicitly provide such explanations? For example, why can a different parameterization of the token-level reward models alone (according to the ablation experiment when $\lambda=0$) lead to a better Pareto front? 2. (Partially related to the first weakness) The generality of the method is unknown

Reviewer 02Rating 4Confidence 3

Strengths

- The related work section is comprehensive. - The method eliminates the need to train multiple separate ARMs or preference-aware modules. Instead, it requires training only one preference-agnostic module and one preference-modulation module for alignment multiple objectives. - The experimental results show superior performance compared to previous test-time alignment methods.

Weaknesses

Overall, I find the methodological design of the paper reasonable, though the experimental section requires further strengthening. If the following concerns can be adequately addressed, I would consider raising my score. - I'm unfamiliar with test-time multi-objective alignment methods based on ARM. Therefore, I'm puzzled about the motivation behind this approach. Since we can fine-tune the ARM, why do we just fine-tune the LLM itself? - The backbone model used is relatively outdated. Employin

Reviewer 03Rating 6Confidence 4

Strengths

1. This paper deals with a very relevant area of using test-time alignment method to achieve multi-objective goals and achieve good experimental result. 2. Unlike the prior art PARM, this method does not require linearly combining different core tensors based on the preference vector during test time, which is more reasonable. 3. The experiments are extensive, including helpfulness and harmlessness evaluation, as well as weak-to-strong extension. It is especially good to see the weak-to-stro

Weaknesses

1. While the experiments setting follows prior work, the LLM used here seems not very up to date. It would be nice if the author can evaluate on more recent LLMs. For example, tulu-3 instead of tulu-2. Other minor issues 1. Typo in equation (10) 2. In Figure 2, the results of RS and MOD are set to zero. Although it is understandable that they are very expensive to run, I am not sure if it is a good idea to set them to be zero.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Machine Learning and Data Classification · Explainable Artificial Intelligence (XAI)