TL;DR
This paper introduces Meta Reward Modeling, a meta-learning approach for personalized LLM alignment that efficiently adapts to individual user preferences with limited feedback, outperforming existing methods.
Contribution
It proposes a novel meta-learning framework for personalized reward models, enabling rapid adaptation and robustness across diverse users.
Findings
MRM improves few-shot personalization performance.
MRM enhances robustness to user variability.
Code is available at https://github.com/ModalityDance/MRM.
Abstract
Alignment of Large Language Models (LLMs) aims to align outputs with human preferences, and personalized alignment further adapts models to individual users. This relies on personalized reward models that capture user-specific preferences and automatically provide individualized feedback. However, developing these models faces two critical challenges: the scarcity of feedback from individual users and the need for efficient adaptation to unseen users. We argue that addressing these constraints requires a paradigm shift from fitting data to learn user preferences to learn the process of preference adaptation. To realize this, we propose Meta Reward Modeling (MRM), which reformulates personalized reward modeling as a meta-learning problem. Specifically, we represent each user's reward model as a weighted combination of base reward functions, and optimize the initialization of these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
