Clustering and Median Aggregation Improve Differentially Private Inference

Kareem Amin; Salman Avestimehr; Sara Babakniya; Alex Bie; Weiwei Kong; Natalia Ponomareva; Umar Syed

arXiv:2506.04566·cs.LG·June 6, 2025

Clustering and Median Aggregation Improve Differentially Private Inference

Kareem Amin, Salman Avestimehr, Sara Babakniya, Alex Bie, Weiwei Kong, Natalia Ponomareva, Umar Syed

PDF

Open Access

TL;DR

This paper introduces a clustering and median aggregation approach to enhance differentially private language model inference, resulting in higher quality synthetic text with lower privacy costs by addressing uniform sampling limitations.

Contribution

The paper proposes a novel clustering-based batch selection and median aggregation method that improves DP inference quality and privacy guarantees over prior uniform sampling techniques.

Findings

01

Clustering input data improves the quality of DP-generated text.

02

Median aggregation reduces sensitivity and enhances privacy guarantees.

03

The method achieves higher representativeness and task performance with lower privacy costs.

Abstract

Differentially private (DP) language model inference is an approach for generating private synthetic text. A sensitive input example is used to prompt an off-the-shelf large language model (LLM) to produce a similar example. Multiple examples can be aggregated together to formally satisfy the DP guarantee. Prior work creates inference batches by sampling sensitive inputs uniformly at random. We show that uniform sampling degrades the quality of privately generated text, especially when the sensitive examples concern heterogeneous topics. We remedy this problem by clustering the input data before selecting inference batches. Next, we observe that clustering also leads to more similar next-token predictions across inferences. We use this insight to introduce a new algorithm that aggregates next token statistics by privately computing medians instead of averages. This approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Adversarial Robustness in Machine Learning · Topic Modeling