Measuring memorization in RLHF for code completion
Aneesh Pappu, Billy Porter, Ilia Shumailov, Jamie Hayes

TL;DR
This paper investigates how different large language model alignment methods, especially RLHF, influence the memorization of training data, highlighting RLHF's potential to reduce data regurgitation and privacy risks.
Contribution
It provides a comparative analysis of memorization in RLHF and direct preference learning methods like IPO, revealing RLHF's advantage in mitigating data memorization.
Findings
RLHF reduces memorization compared to fine-tuning.
Memorized data during fine-tuning often remains after RLHF.
Direct preference learning increases the likelihood of data regurgitation.
Abstract
Reinforcement learning with human feedback (RLHF) has become the dominant method to align large models to user preferences. Unlike fine-tuning, for which there are many studies regarding training data memorization, it is not clear how memorization is affected by or introduced in the RLHF alignment process. Understanding this relationship is important as real user data may be collected and used to align large models; if user data is memorized during RLHF and later regurgitated, this could raise privacy concerns. In addition to RLHF, other methods such as Direct Preference Optimization (DPO) and PO have gained popularity for learning directly from human preferences, removing the need for optimizing intermediary reward models with reinforcement learning. In this work, we analyze how training data memorization can surface and propagate through each phase of RLHF and direct preference…
Peer Reviews
Decision·ICLR 2025 Poster
(1) The observations in this paper are interesting. (2) The observations can potentially help important real-world applications. (3) The paper is well written and easy to understand.
(1) It would be interesting to show the observations in this paper are generally applicable in various model backbones with various sizes. It seems that currently only Gemini Nano-1(1.8B) is used as the base mode in the experiments. (2) Related to the statement that IPO exhibits stronger memorization of preference data than RLHF, currently the result is obtained using one model backbone. It would be great to validate this statement using various model backbones. Besides, it seems that this stat
1. Thorough experimental methodology with appropriate controls and metrics 2. Important practical implications for deploying large language models 3. Novel and significant findings about RLHF's memorization properties 4. Validation across multiple scales, domains, and datasets 5. Clear writing and comprehensive presentation of results 6. Strong technical foundation and careful experimental design
1. Main experiments focus on one synthetic dataset, though results are validated on other datasets 2. Could explore a wider range of model architectures and scales 3. Some hyperparameter choices could be better justified 4. Analysis could include more direct preference learning methods beyond IPO 5. Could provide more detailed analysis of failure cases and limitations
1. The paper studied a very practical and critical problem; the memorization of RLHF and its risk of data leakage. 2. The experiments are detailed.
I am new to RLHF with code completion model. I didn't find apparent weakness.
Videos
Taxonomy
TopicsEmbedded Systems Design Techniques · Parallel Computing and Optimization Techniques · Real-time simulation and control systems
MethodsALIGN · Focus
