Towards Federated RLHF with Aggregated Client Preference for LLMs
Feijie Wu, Xiaoze Liu, Haoyu Wang, Xingchen Wang, Lu Su, Jing Gao

TL;DR
This paper introduces federated reinforcement learning with human feedback (RLHF) for large language models, enabling preference learning without sharing sensitive data, and demonstrates improved content quality through novel aggregation methods.
Contribution
It proposes FedBis and FedBiscuit, innovative federated RLHF techniques that encode and aggregate client preferences, addressing heterogeneity and privacy concerns in preference-based LLM fine-tuning.
Findings
Significant improvement in professionalism of generated content.
First federated RLHF benchmark with heterogeneous preferences.
Effective handling of preference heterogeneity and reward hacking.
Abstract
Reinforcement learning with human feedback (RLHF) fine-tunes a pretrained large language model (LLM) using user preference data, enabling it to generate content aligned with human preferences. However, due to privacy concerns, users may be reluctant to share sensitive preference data. To address this, we propose utilizing Federated Learning (FL) techniques, allowing large-scale preference collection from diverse real-world users without requiring them to transmit data to a central server. Our federated RLHF methods (i.e., FedBis and FedBiscuit) encode each client's preferences into binary selectors and aggregate them to capture common preferences. In particular, FedBiscuit overcomes key challenges, such as preference heterogeneity and reward hacking, through innovative solutions like grouping clients with similar preferences to reduce heterogeneity and using multiple binary selectors to…
Peer Reviews
Decision·ICLR 2025 Poster
(1) This paper provides a novel federated learning-based reinforcement learning method to enhance the output ability of LLM. To the best of my knowledge, this is the first time to employ federated learning technique to enable diverse user collection for RLHF. (2) The organization of this paper is clear and easy to follow. In particular, Figure 2 clearly depicts the outline of the proposed FedBis model. (3) The algorithm design in Section 5.1 is practical and new to me. Besides, the key compone
(1) The experimental results in Section 7 are not extensive enough. This paper conducts experiments on two NLP tasks, summarization and question-answering, and more experiments on other tasks should be complemented, like few-shot learning, synthetic, code completion, multi-needle retrieval. (2) I admire the proposed FL-based reinforcement learning method, but it will be more readable and more concise to summarize the contents in Section 5.1 Algorithm Design into a pseudocode (i.e. shown in an A
- The paper deals with an important research problem. - The paper is generally well orgnaised and presente, and easy to follow. - The authors proposed a sound solution to the research problem, showing advantages over baselines.
- The challenged identifiied on Page 2 seems univeral and apply to many federated learning scenarios, not specified to this paper's research context. Therefore, although the overall problem setting seems to be new, the key research problem to tackle remains conventional. - To deal with excessive comptuation overhad and preference heterogeneity seem to be a common issue that has been addressed by various previous efforts. The clustering approach to addressing the key issue does not appear novel t
1、Generally, this paper is well-written with clear explanations of the methodologies and results. 2、The proposed FedBis and FedBiscui demonstrate good performance on several benchmarks, suggesting good-quality research and implementation. 3、This paper addresses the heterogeneity issue, an important issue in all fields of federated learning related research.
1、It would be better to include experiments that explore the sensitivity of the hyperparameters, particularly the number of clusters $U$, which is a crucial or even central parameter in the FedBiscuit method. However, the current results only present cases for $U=3$ and $U=5$, providing limited insights from these two configurations. 2、The authors should further explain why FedBiscuit with $U=3$ performs better in some cases while $U=5$ yields superior results in others. Since FedBiscuit addres
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Cryptography and Data Security · Cloud Data Security Solutions
MethodsALIGN
