TL;DR
FSPO is a few-shot learning algorithm that personalizes large language models by quickly inferring user-specific reward functions from minimal preferences, using synthetic data for training.
Contribution
The paper introduces FSPO, a novel meta-learning approach for LLM personalization that leverages synthetic preference data and a new user description rationalization technique.
Findings
FSPO achieves 87% winrate in synthetic user personalization.
FSPO attains 70% winrate with real users in open-ended QA.
Synthetic data with high diversity and coherence is crucial for transfer.
Abstract
Effective personalization of LLMs is critical for a broad range of user-interfacing applications such as virtual assistants and content curation. Inspired by the strong in-context capabilities of LLMs, we propose few-shot preference optimization (FSPO), an algorithm for LLM personalization that reframes reward modeling as a meta-learning problem. Under FSPO, an LLM learns to quickly infer a personalized reward function for a user via a few labeled preferences. FSPO also utilizes user description rationalization (RAT) to encourage better reward modeling and instruction following, recovering performance with the oracle user description. Since real-world preference data is challenging to collect at scale, we propose careful design choices to construct synthetic preference datasets for personalization, generating over 1M synthetic personalized preferences using publicly available LLMs. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
