A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs
Mahmoud Srewa, Tianyu Zhao, Salma Elmalaki

TL;DR
This paper systematically evaluates how different preference aggregation methods in federated RLHF impact the alignment and fairness of large language models, proposing an adaptive scheme that improves pluralistic alignment.
Contribution
It introduces a comprehensive evaluation framework and a novel adaptive aggregation scheme for federated RLHF, enhancing fairness and diversity in LLM alignment.
Findings
Adaptive aggregation improves fairness in LLM alignment.
The proposed method maintains competitive alignment scores.
Systematic evaluation framework aids in assessing pluralistic alignment.
Abstract
This paper addresses the challenge of aligning large language models (LLMs) with diverse human preferences within federated learning (FL) environments, where standard methods often fail to adequately represent diverse viewpoints. We introduce a comprehensive evaluation framework that systematically assesses the trade-off between alignment quality and fairness when using different aggregation strategies for human preferences. In our federated setting, each group locally evaluates rollouts and produces reward signals, and the server aggregates these group-level rewards without accessing any raw data. Specifically, we evaluate standard reward aggregation techniques (min, max, and average) and introduce a novel adaptive scheme that dynamically adjusts preference weights based on a group's historical alignment performance. Our experiments on question-answering (Q/A) tasks using a PPO-based…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The problem is well motivated, as LLMs are deployed widely, we will be gathering private data from users and considering the problem of pluralistic data with private data in the federated learning setting is quite interesting.
1. My most major concern is regarding the readability of the paper. It is not clear what the user types are: does each client contain a different user? or are groups of clients assigned to the same user types? It seems like the variables l and N are used interchangeably? It also took me a while to understand what the evaluation metrics mean. Significant work needs to be put in to make the paper more readable. 2. Experiments with just a 2B is too small and at least a 7B model experiments are the
1. This paper addresses a timely problem: fair preference aggregation in federated RLHF for pluralistic LLM alignment. 2. Proposes a systematic evaluation framework with comprehensive experiments across reward types and aggregation schemes. 3. The adaptive weighting strategy effectively boosts fairness without task demonstrations or demographic data.
1. The adaptive alpha aggregation is a heuristic extension of existing work, offering limited technical novelty. 2. Experiments are limited to multiple-choice QA with model-generated preferences; generalization to open-ended tasks or real human feedback remains unverified. 3. Evaluation relies solely on Gemma-2B-it; results may not generalize to larger or architecturally different LLMs, limiting the robustness of conclusions.
1. This paper measures the trained performance across numerous metrics, providing a convincing and comprehensive evaluation. 2. The paper considers quite a number of client reward methods and server aggregation approaches and shows their performance under different combinations. I can see that the authors put a great deal of effort into the experiments.
1. I find that the paper is weak in surveying the related work. Some papers, such as FedBiscuit [1], are supposed to be discussed and compared in the experiments. 2. I am pretty confused by this work. In Section 3, the authors state that they train the policy model $\pi_{\theta}^{policy}$ using PPO. As I know, the PPO under RLHF requires a reward model and a policy model. However, I cannot find the reward model. Instead, the work aggregates rewards but does not explain how they are obtained. Au
1. This paper concentrates on an important question of RLHF with federated learning, i.e., the method to aggregate diverse preference signals which are associated with the fairness. 2. A systematic evaluation method is proposed. 3. A new adaptative aggregation strategy is proposed.
1. Several main important modules need more clarification - How RLHF with federated learning is conducted together with the evaluation system? How the parameters are updated? - Details about the evaluation set Pew Research Center’s Global Attitudes Surveys dataset. - Details about the preference prediction task and preference ranking task. 2. Experiments on more types of experiments are required to verify the practicality and generalizability of the whole evaluation framework, as the Pew Researc
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Natural Language Processing Techniques
