Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Modeling
Young Hyun Cho, Will Wei Sun

TL;DR
This paper introduces a differentially private reinforcement learning framework that focuses privacy preservation on reward modeling, with theoretical analysis and empirical validation showing improved private alignment in language models.
Contribution
It proposes a novel privacy-preserving RL framework that applies differential privacy to reward learning, with theoretical bounds and empirical results demonstrating its effectiveness.
Findings
Theoretically characterizes the privacy-induced suboptimality gap.
Establishes a minimax lower bound for private reward learning.
Empirically outperforms existing private baselines on language model alignment tasks.
Abstract
Preference-based fine-tuning has become an important component in training large language models, and the data used at this stage may contain sensitive user information. A central question is how to design a differentially private pipeline that is well suited to the distinct structure of reinforcement learning from human feedback. We propose a privacy-preserving framework that imposes differential privacy only on reward learning and derives the final policy from the resulting private reward model. Theoretically, we study the suboptimality gap and show that privacy contributes an additional additive term beyond the usual non-private statistical error. We also establish a minimax lower bound and show that the dominant term changes with sample size and privacy level, which in turn characterizes regimes in which the upper bound is rate-optimal up to logarithmic factors. Empirically,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Mobile Crowdsensing and Crowdsourcing · Adversarial Robustness in Machine Learning
