Privately Aligning Language Models with Reinforcement Learning

Fan Wu; Huseyin A. Inan; Arturs Backurs; Varun Chandrasekaran,; Janardhan Kulkarni; Robert Sim

arXiv:2310.16960·cs.LG·May 6, 2024·1 cites

Privately Aligning Language Models with Reinforcement Learning

Fan Wu, Huseyin A. Inan, Arturs Backurs, Varun Chandrasekaran,, Janardhan Kulkarni, Robert Sim

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper explores how to align large language models with user preferences using reinforcement learning while ensuring privacy through differential privacy techniques, balancing utility and privacy protections.

Contribution

It introduces a new differential privacy framework for RL-based alignment of language models and proves its correctness, with experimental validation.

Findings

01

Effective privacy-preserving alignment with competitive utility

02

Framework applicable to RL without human feedback and RLHF paradigms

03

Strong privacy guarantees demonstrated through experiments

Abstract

Positioned between pre-training and user deployment, aligning large language models (LLMs) through reinforcement learning (RL) has emerged as a prevailing strategy for training instruction following-models such as ChatGPT. In this work, we initiate the study of privacy-preserving alignment of LLMs through Differential Privacy (DP) in conjunction with RL. Following the influential work of Ziegler et al. (2020), we study two dominant paradigms: (i) alignment via RL without human in the loop (e.g., positive review generation) and (ii) alignment via RL from human feedback (RLHF) (e.g., summarization in a human-preferred way). We give a new DP framework to achieve alignment via RL, and prove its correctness. Our experimental results validate the effectiveness of our approach, offering competitive utility while ensuring strong privacy protections.

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 8· accept, good paperConfidence 3

Strengths

(1) The approach proposed in this paper for aligning language models with PPO in a privacy-preserving way is original; (2) The paper is clearly written and well-organized. (3) The paper gives a quite comprehensive analysis of the procedure and emphasize the difficult issues in the implementation.

Weaknesses

(1)The DP part is too condensed to understand. The authors used DP-SGD on several occasions but without a clear explanation of this algorithm. And in the main text, I could not find a concrete DP algorithm and a clear procedure how it is combined with the alignment. (2) Algorithm 1 is not original. I don’t see the reason why it was presented in detail in the paper. The PRIVATE alignment should be more interesting.

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

The paper combines two well-understood algorithms, DP-SGD and PPO in a well-motivated task of reinforcement with human feedback and supervised fine-tuning. The experiments and the results are clear and the application of private model fine-tuning is reasonable. The paper is well organized and the writing is clear. The paper is well written and the algorithm is clearly explained. Their claims are based on the GPT-2 family of models and run experiments on the ROUGE metrics and the TweetEval bench

Weaknesses

While the examples are helpful, the overall motivation could be a bit stronger. What are we protecting and why? What is the threat model around incorporating human feedback? Are their examples of memorization from human feedback? The experiments in the main body do not include error or number of trials details. It is unclear in Table 1 why models with less provacy should do worse than those with more privacy (GPT-2 Medium, eps 4->8, or GPT-2 Large eps 8->Inf). Such results demand further study

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 2

Strengths

1. The paper proposes a differentially private framework for aligning LLMs with RL, offering mathematical guarantees of privacy. 2. The paper empirically evaluates the framework on tasks like positive review generation and summarization, showing that it offers competitive utility while ensuring strong privacy protections.

Weaknesses

1. The paper employs DPSGD to ensure privacy in the alignment of large language models through reinforcement learning. While the use of DPSGD is well-established in the privacy literature. Furthermore, the paper does not introduce significant modifications to the RLHF process. The innovation seems to be more focused on engineering adjustments rather than novel theoretical contributions. 2. The paper discusses the trade-offs between privacy and utility but does not present these results in an i

Videos

Privately Aligning Language Models with Reinforcement Learning· slideslive

Taxonomy

TopicsPrivacy-Preserving Technologies in Data