Personalized Language Modeling from Personalized Human Feedback

Xinyu Li; Ruiyang Zhou; Zachary C. Lipton; Liu Leqi

arXiv:2402.05133·cs.CL·December 10, 2024·5 cites

Personalized Language Modeling from Personalized Human Feedback

Xinyu Li, Ruiyang Zhou, Zachary C. Lipton, Liu Leqi

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces Personalized-RLHF, a framework that enables large language models to generate personalized responses by learning individual user preferences through a lightweight user model, improving alignment with user-specific needs.

Contribution

The paper proposes a novel Personalized-RLHF framework that efficiently captures individual user preferences and enables scalable, personalized language model responses without requiring explicit preference articulation.

Findings

01

Personalized-LM responses better match individual user preferences.

02

P-RLHF outperforms vanilla RLHF and prompting-based methods.

03

The approach scales efficiently with increasing users.

Abstract

Personalized large language models (LLMs) are designed to tailor responses to individual user preferences. While Reinforcement Learning from Human Feedback (RLHF) is a commonly used framework for aligning LLMs with human preferences, vanilla RLHF assumes that all human preferences share the same distribution, preventing fine-tuned LLMs from generating personalized content when user preferences are diverse. In this work, we propose Personalized-RLHF (P-RLHF), an efficient framework that utilizes a lightweight user model to capture individual user preferences and jointly learns the user model and the personalized LLM from human feedback. P-RLHF exhibits the following three characteristics: (1) It enables an LLM to generate personalized content and scale efficiently with growing number of users. (2) It handles both explicit user preferences described as textual input and implicit user…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

* The structure of this article is well-organized and clearly articulated, making it easy for readers to follow the flow of ideas and concepts presented throughout the text. Each section is logically arranged, allowing for a seamless understanding of the material. * The section on RLHF is detailed, with a clear progression from general RLHF to the specifically designed personalized DPO. This work considers different user preferences from multiple angles and granularities. It clearly explains th

Weaknesses

* It seems that this article does not have released code. I'm not sure if I just couldn't find it, but without released code, reproducibility cannot be guaranteed. * Like other RL works, the notation in this paper is too numerous and complex. Although the structure and presentation of the article are good, it somewhat hinders the readers' understanding. Perhaps a table to organize the notations could be helpful. * There are concerns regarding the effectiveness of not training an additional rew

Reviewer 02Rating 3Confidence 4

Strengths

1. The issues related to DPO in personalized modeling are thoroughly discussed. 2. The presentation is easy to follow and well-structured. 3. In both synthetic and human evaluations, P-DPO significantly outperforms DPO.

Weaknesses

For a detailed discussion of the reviewers' concerns, please refer to the summary.

Reviewer 03Rating 6Confidence 4

Strengths

1. The paper presents the P-RLHF framework, which integrates both explicit and implicit user models to facilitate personalized learning from user feedback, thereby introducing a novel approach to the research on personalized LLMs. The introduction of P-DPO represents an innovative optimization method that effectively balances personalization and generalization for both known and unknown users. This approach demonstrates excellent scalability while addressing the personalized needs of multiple us

Weaknesses

1. Zero-Length Responses in Experiment One: The occurrence of zero-length responses, which the authors justify as an expected outcome via mathematical proofs, raises questions about whether an LLM should indeed produce responses with no content. This result appears to stem from a polarized experimental design and methodology, which may not reflect practical application scenarios or user expectations. A more nuanced experimental setup could better balance realistic use cases with the need for dis

Code & Models

Repositories

humainlab/personalized_rlhf
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques

MethodsALIGN