PARM: Multi-Objective Test-Time Alignment via Preference-Aware Autoregressive Reward Model
Baijiong Lin, Weisen Jiang, Yuancheng Xu, Hao Chen, Ying-Cong Chen

TL;DR
PARM introduces a unified, preference-aware autoregressive reward model that efficiently aligns large language models with diverse user preferences during inference, reducing costs and improving accuracy.
Contribution
It proposes PARM, a single, preference-aware ARM trained across all preference dimensions, addressing limitations of previous multi-ARM approaches.
Findings
PARM reduces inference costs compared to multiple ARMs.
PARM achieves better alignment with user preferences.
PARM enables weak-to-strong guidance for resource-limited settings.
Abstract
Multi-objective test-time alignment aims to adapt large language models (LLMs) to diverse multi-dimensional user preferences during inference while keeping LLMs frozen. Recently, GenARM (Xu et al., 2025) first independently trains Autoregressive Reward Models (ARMs) for each preference dimension without awareness of each other, then combines their outputs based on user-specific preference vectors during inference to achieve multi-objective test-time alignment, leading to two key limitations: the need for \textit{multiple} ARMs increases the inference cost, and the separate training of ARMs causes the misalignment between the guided generation and the user preferences. To address these issues, we propose Preference-aware ARM (PARM), a single unified ARM trained across all preference dimensions. PARM uses our proposed Preference-Aware Bilinear Low-Rank Adaptation (PBLoRA), which employs a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Recommender Systems and Techniques
