Safely Learning with Private Data: A Federated Learning Framework for   Large Language Model

JiaYing Zheng; HaiNan Zhang; LingXiang Wang; WangJie Qiu; HongWei; Zheng; ZhiMing Zheng

arXiv:2406.14898·cs.CR·December 24, 2024

Safely Learning with Private Data: A Federated Learning Framework for Large Language Model

JiaYing Zheng, HaiNan Zhang, LingXiang Wang, WangJie Qiu, HongWei, Zheng, ZhiMing Zheng

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces FL-GLM, a federated learning framework for large language models that enhances privacy and efficiency by preventing data leakage and enabling parallel training across distributed private data sources.

Contribution

The paper proposes a novel FL framework for LLMs that secures private data against attacks and improves training efficiency through client-side input/output placement, key-encryption, and optimized batching strategies.

Findings

01

FL-GLM achieves comparable performance to centralized models on NLP tasks.

02

The framework effectively prevents embedding gradient and peer-client reverse engineering attacks.

03

Experimental results show improved training efficiency with various acceleration methods.

Abstract

Private data, being larger and quality-higher than public data, can greatly improve large language models (LLM). However, due to privacy concerns, this data is often dispersed in multiple silos, making its secure utilization for LLM training a challenge. Federated learning (FL) is an ideal solution for training models with distributed private data, but traditional frameworks like FedAvg are unsuitable for LLM due to their high computational demands on clients. An alternative, split learning, offloads most training parameters to the server while training embedding and output layers locally, making it more suitable for LLM. Nonetheless, it faces significant challenges in security and efficiency. Firstly, the gradients of embeddings are prone to attacks, leading to potential reverse engineering of private data. Furthermore, the server's limitation of handle only one client's training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

TAP-LLM/SplitFedLLM
pytorchOfficial

Videos

Safely Learning with Private Data: A Federated Learning Framework for Large Language Model· underline

Taxonomy

TopicsPrivacy-Preserving Technologies in Data