Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs
Zhaowei Zhang, Fengshuo Bai, Qizhi Chen, Chengdong Ma, Mingzhi Wang, Haoran Sun, Zilong Zheng, Yaodong Yang

TL;DR
Amulet is a training-free, real-time framework that personalizes large language model outputs by online test-time adaptation guided by user prompts, improving alignment with individual preferences efficiently.
Contribution
It introduces Amulet, a novel method for real-time, test-time personalization of LLMs without retraining, using online optimization with a closed-form solution for efficiency.
Findings
Significant performance improvements across various LLMs and datasets.
Maintains computational efficiency with negligible additional cost.
Effective personalization aligning with diverse user preferences.
Abstract
How to align large language models (LLMs) with user preferences from a static general dataset has been frequently studied. However, user preferences are usually personalized, changing, and diverse regarding culture, values, or time. This leads to the problem that the actual user preferences often do not coincide with those trained by the model developers in the practical use of LLMs. Since we cannot collect enough data and retrain for every demand, researching efficient real-time preference adaptation methods based on the backbone LLMs during test time is important. To this end, we introduce Amulet, a novel, training-free framework that formulates the decoding process of every token as a separate online learning problem with the guidance of simple user-provided prompts, thus enabling real-time optimization to satisfy users' personalized preferences. To reduce the computational cost…
Peer Reviews
Decision·ICLR 2025 Poster
S1: Test-time realignment for personalized preferences is an interesting research topic in the field of LLMs. S2: The proposed framework approaches the decoding of each token as an independent online learning problem and introduces a closed-form solution for optimization, which is novel to me. S3: The paper is well-written, and the experimental results seems good.
W1: Insufficient baselines. Only Pref and LA are included for comparison. More alignment approaches are needed. Additionally, the Pref baseline appears trivial in its implementation. Given that the studied preference dimensions in this work are easy to define, it can be effective to use more sophisticated prompt engineering approaches that could serve as stronger baselines, e.g., emphasizing the output format. W2: Lack of Evaluation on Implicit Preferences. While the paper demonstrates effectiv
Strengths: 1. The general idea is a hot topic in the field of LLMs (personalization alignment and test-time personalization alignment). 2. The idea of treating each step of token generation as an independent optimization problem to solve test-time alignment is an interesting exploration. 3. The paper is well-written, and the methods and formulas are easy to understand.
Weaknesses: 1. The experimental setup is not very clear. How did the authors tune the hyperparameters? See Questions below. 2. Based on my understanding, the method described in the paper ultimately results in a weighted of the base prompt generation probability, user-specific prompt generation probability. Therefore, the authors should provide a baseline that only adjusts the fine-tuned $\alpha$, which essentially becomes Contrastive Decoding. 3. The authors emphasize the importance of real-tim
The paper is well-organized. It is reasonable for me to perform test time alignment in LLMs for light-weight preference optimization. The proposed online learning decoding process by formulating each token generation as a separate online learning problem seems novel. Besides, the authors provide a closed-form solution to reduce computational costs Experiments with several datasets and backbone LLMs demonstrate the effectiveness of the proposed method.
Insufficient baselines. Only LA is used as the baseline model, likely due to the scarcity of tuning-free test time alignment approaches. However, other baselines from related topics, such as those introduced in the related work, could be adapted to verify the effectiveness. One major merit of tuning-free test time alignment is light-weighted. To this end, the time and computational complexity could be analyzed. The computational cost could be reported. Broken sentences, such as Line 209: 'Sinc
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsALIGN
