Everyone Deserves A Reward: Learning Customized Human Preferences

Pengyu Cheng; Jiawen Xie; Ke Bai; Yong Dai; Nan Du

arXiv:2309.03126·cs.CL·September 18, 2023·2 cites

Everyone Deserves A Reward: Learning Customized Human Preferences

Pengyu Cheng, Jiawen Xie, Ke Bai, Yong Dai, Nan Du

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a method for learning customized reward models that capture diverse human preferences across different domains, improving personalized alignment of language models.

Contribution

It presents a new domain-specific preference dataset and a three-stage learning scheme for personalized reward models, enhancing data efficiency and preference preservation.

Findings

01

The three-stage scheme improves customization effectiveness.

02

General preference enrichment helps preserve broad alignment.

03

Customized preference imitation enhances personalization.

Abstract

Reward models (RMs) are essential for aligning large language models (LLMs) with human preferences to improve interaction quality. However, the real world is pluralistic, which leads to diversified human preferences with respect to different religions, politics, cultures, etc. Moreover, each individual can have their unique preferences on various topics. Neglecting the diversity of human preferences, current human feedback aligning methods only consider a general reward model, which is below satisfaction for customized or personalized application scenarios. To explore customized preference learning, we collect a domain-specific preference (DSP) dataset, which includes preferred responses for each given query from four practical domains. Besides, from the perspective of data efficiency, we propose a three-stage customized RM learning scheme, then empirically verify its effectiveness on…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

- **Originality**: The paper proposes a novel benchmark dataset that can foster work into the direction of reward-model customization and for individual user preferences or specific use cases. They additionally propose a baseline approach to tackle this dataset, which is of limited novelty. - **Quality**: The methodology presented in the paper, particularly the development of a new synthetic dataset and a multi-stage fine-tuning training process, demonstrates a high level of technical soundnes

Weaknesses

### Primary concerns: - Limited Novelty of Baseline Approach: While the introduction of the benchmark dataset is innovative, the baseline approach proposed for addressing this dataset is somewhat derivative, primarily building upon the work by Askell et al. (2021). This is not in itself problematic if we consider the dataset to be the main contribution of the paper, but combined with the synthetic nature of the dataset it limits the significance of the contribution. - Synthetic Nature of Datas

Reviewer 02Rating 3· reject, not good enoughConfidence 3

Strengths

Interesting approach in generating context depended samples Potentially interesting simple fine-tuning techniques

Weaknesses

The paper could benefit from more detailed explanations of the data generation process and the used in the approach. It does not provide a comparison of their approach with other state-of-the-art methods in the field. Overall the clarity of the paper could be improved

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 2

Strengths

- The generated DSP dataset can be useful to the community. - The three-stage training scheme for customized RM learning looks legit to me. - The discovery that imitation learning on customized preferences and general preference data enrichment preserves the RMs’ general preferring ability when fitting customized human preferences is interesting.

Weaknesses

It seems to me that the DSP dataset could inherently contain biases based on the chosen domains. How these biases are identified and mitigated is not clearly addressed, which is crucial in a study aiming to cater to diverse human preferences. The efficacy of the training scheme is tested on a specific dataset (DSP). The extent to which these findings can be generalized to other datasets or real-world scenarios is also not very clear.

Code & Models

Repositories

linear95/dsp
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems