Responsible Federated LLMs via Safety Filtering and Constitutional AI
Eunchung Noh, Jeonghun Baek

TL;DR
This paper introduces safety filtering and constitutional AI techniques into federated learning for large language models, significantly enhancing their safety and trustworthiness.
Contribution
It is the first to integrate responsible AI methods into federated LLM training, addressing safety concerns in decentralized data environments.
Findings
Safety filtering and constitutional AI improve LLM safety by over 20% on AdvBench.
The methods effectively reduce unsafe and inappropriate responses in federated LLMs.
The approach enhances trustworthiness without compromising model performance.
Abstract
Recent research has increasingly focused on training large language models (LLMs) using federated learning, known as FedLLM. However, responsible AI (RAI), which aims to ensure safe and trustworthy responses, remains underexplored in this context. In FedLLM, client-side training data may contain harmful content, resulting in unsafe LLMs that can generate inappropriate responses. Aggregating such models into a global model and redistributing it to clients risks the widespread deployment of unsafe LLMs. To address this, we incorporate two well-established RAI techniques into FedLLM: safety filtering and constitutional AI. Our experiments show that these methods significantly improve LLM safety, achieving over 20% improvement on AdvBench.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEuropean Criminal Justice and Data Protection · Privacy-Preserving Technologies in Data · Artificial Intelligence in Law
