PROPS: Progressively Private Self-alignment of Large Language Models
Noel Teku, Fengwei Tian, Payel Bhattacharjee, Souradip Chakraborty, Amrit Singh Bedi, Ravi Tandon

TL;DR
PROPS is a multi-stage framework for aligning large language models that enhances privacy of human preferences while maintaining high utility, outperforming existing privacy-preserving methods in alignment tasks.
Contribution
It introduces PROPS, a novel multi-stage, preference-level privacy-preserving alignment framework with theoretical guarantees and empirical validation across multiple models and datasets.
Findings
PROPS achieves up to 3x higher win-rates than DP-SGD.
PROPS outperforms RR-based alignment by 2.5x in win-rates.
The framework maintains high privacy with improved utility.
Abstract
Alignment is a key step in developing Large Language Models (LLMs) using human feedback to ensure adherence to human values and societal norms. Dependence on human feedback raises privacy concerns about how much a labeler's preferences may reveal about their personal values, beliefs, and personality traits. Existing approaches, such as Differentially Private SGD (DP-SGD), provide rigorous privacy guarantees by privatizing gradients during fine-tuning and alignment but can provide more privacy than necessary as human preferences are tied only to labels of (prompt, response) pairs and can degrade model utility. This work focuses on LLM alignment with preference-level privacy, which preserves the privacy of preference labels provided by humans. We propose PROPS (PROgressively Private Self-alignment), a multi-stage privacy preserving alignment framework where privately aligned models in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Computational and Text Analysis Methods · Topic Modeling
