A Unified Theoretical Analysis of Private and Robust Offline Alignment: from RLHF to DPO
Xingyu Zhou, Yulian Wu, Francesco Orabona

TL;DR
This paper provides a unified theoretical framework analyzing how privacy and adversarial corruption affect offline alignment methods like RLHF and DPO, revealing key differences between privacy-first and corruption-first scenarios.
Contribution
It introduces a reduction framework under linear models to analyze privacy and robustness interplay, establishing a separation between LTC and CTL scenarios in offline alignment.
Findings
LTC is more challenging than CTL in offline alignment.
The reduction framework links offline alignment to logistic regression parameter estimation.
Advances theoretical understanding of privacy and robustness in offline alignment.
Abstract
In this paper, we theoretically investigate the effects of noisy labels in offline alignment, with a focus on the interplay between privacy and robustness against adversarial corruption. Specifically, under linear modeling assumptions, we present a unified analysis covering both reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under different privacy-corruption scenarios, such as Local differential privacy-then-Corruption (LTC), where human preference labels are privatized before being corrupted by an adversary, and Corruption-then-Local differential privacy (CTL), where labels are corrupted before privacy protection. Our analysis leverages a reduction framework that reduces the offline alignment problem under linear modeling assumptions to parameter estimation in logistic regression. This framework allows us to establish an interesting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuction Theory and Applications
MethodsFocus
