Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift
Seongho Son, William Bankes, Sayak Ray Chowdhury, Brooks Paige, Ilija Bogunovic

TL;DR
This paper introduces NS-DPO, a novel method for fine-tuning large language models that accounts for changing user preferences over time, improving robustness and alignment in dynamic environments.
Contribution
The paper proposes a computationally efficient non-stationary preference optimization method using a dynamic model and theoretical analysis of its convergence and error bounds.
Findings
NS-DPO outperforms baseline algorithms under preference drift.
NS-DPO maintains robustness in non-stationary preference scenarios.
The method does not sacrifice performance in stationary cases.
Abstract
Current Large Language Model (LLM) preference optimization algorithms do not account for temporal preference drift, which can lead to severe misalignment. To address this limitation, we propose Non-Stationary Direct Preference Optimisation (NS-DPO) that models time-dependent reward functions with a Dynamic Bradley-Terry model. NS-DPO proposes a computationally efficient solution by introducing only a single discount parameter in the loss function, which is used for exponential weighting that proportionally focuses learning on more time-relevant datapoints. We theoretically analyze the convergence of NS-DPO in a general setting where the exact nature of the preference drift is not known, providing upper bounds on the estimation error and regret caused by non-stationary preferences. Finally, we demonstrate the effectiveness of NS-DPO for fine-tuning LLMs under drifting preferences. Using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsEconomic and Environmental Valuation · Consumer Market Behavior and Pricing
