Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models

Yingqi Hu; Zhuo Zhang; Jingyuan Zhang; Jinghua Wang; Qifan Wang; Lizhen Qu; Zenglin Xu

arXiv:2506.06060·cs.CL·February 26, 2026

Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models

Yingqi Hu, Zhuo Zhang, Jingyuan Zhang, Jinghua Wang, Qifan Wang, Lizhen Qu, Zenglin Xu

PDF

Open Access

TL;DR

This paper demonstrates that federated large language models can unintentionally leak private data through simple extraction methods, highlighting significant privacy risks and providing a benchmark for future privacy-preserving techniques.

Contribution

It introduces effective data extraction strategies to reveal privacy leaks in FedLLMs and establishes a benchmark dataset and evaluation framework for privacy risk assessment.

Findings

01

Up to 56.6% of victim-exclusive PII can be recovered.

02

Names, addresses, and birthdays are highly vulnerable.

03

Proposes simple extraction methods that exploit memorization in FedLLMs.

Abstract

Federated large language models (FedLLMs) enable cross-silo collaborative training among institutions while preserving data locality, making them appealing for privacy-sensitive domains such as law, finance, and healthcare. However, the memorization behavior of LLMs can lead to privacy risks that may cause cross-client data leakage. In this work, we study the threat of cross-client data extraction, where a semi-honest participant attempts to recover personally identifiable information (PII) memorized from other clients' data. We propose three simple yet effective extraction strategies that leverage contextual prefixes from the attacker's local data, including frequency-based prefix sampling and local fine-tuning to amplify memorization. To evaluate these attacks, we construct a Chinese legal-domain dataset with fine-grained PII annotations consistent with CPIS, GDPR, and CCPA standards,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Data Quality and Management · Topic Modeling