CanaryBench: Stress Testing Privacy Leakage in Cluster-Level Conversation Summaries
Deep Mehta

TL;DR
CanaryBench is a reproducible stress testing framework that detects privacy leaks in cluster-level conversation summaries by injecting synthetic secret strings and measuring their leakage in published summaries.
Contribution
This work introduces CanaryBench, a novel method for quantifying privacy leakage in conversation summaries using synthetic canaries and simple defenses.
Findings
Canary leakage was observed in 50 of 52 canary-containing clusters.
A minimal defense with cluster-size threshold and redaction eliminated leakage.
The approach provides a measurable way to assess privacy risks in published summaries.
Abstract
Aggregate analytics over conversational data are increasingly used for safety monitoring, governance, and product analysis in large language model systems. A common practice is to embed conversations, cluster them, and publish short textual summaries describing each cluster. While raw conversations may never be exposed, these derived summaries can still pose privacy risks if they contain personally identifying information (PII) or uniquely traceable strings copied from individual conversations. We introduce CanaryBench, a simple and reproducible stress test for privacy leakage in cluster-level conversation summaries. CanaryBench generates synthetic conversations with planted secret strings ("canaries") that simulate sensitive identifiers. Because canaries are known a priori, any appearance of these strings in published summaries constitutes a measurable leak. Using TF-IDF embeddings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Mental Health via Writing · Privacy-Preserving Technologies in Data
