APPA: Adaptive Preference Pluralistic Alignment for Fair Federated RLHF of LLMs
Mahmoud Srewa, Tianyu Zhao, Salma Elmalaki

TL;DR
APPA introduces an adaptive method for federated RLHF that dynamically reweights group rewards, enhancing fairness and worst-group alignment without access to raw preference data.
Contribution
The paper proposes APPA, a novel adaptive reward reweighting framework for federated RLHF that improves fairness and worst-group alignment in large language models.
Findings
APPA improves worst group alignment by up to 28% over average aggregation.
APPA maintains higher overall alignment than min aggregation in most cases.
APPA demonstrates effectiveness across multiple model families and benchmark datasets.
Abstract
Aligning large language models (LLMs) with diverse human preferences requires pluralistic alignment, where a single model must respect the values of multiple distinct groups simultaneously. In federated reinforcement learning from human feedback (FedRLHF), these groups align a shared policy without centralizing preference data, which makes fair reward aggregation essential. Existing aggregation methods exhibit clear trade offs: average based aggregation systematically under aligns worst performing groups, while min aggregation prioritizes worst group performance at the cost of overall alignment. We propose APPA, an Adaptive Preference Pluralistic Alignment framework that dynamically reweights group level rewards based on historical alignment rewards. Our approach prioritizes under aligned groups without degrading well aligned ones, while requiring no access to raw preference data.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
