VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models
Yuxiang Wang, Hongyu Liu, Dekun Chen, Xueyao Zhang, Zhizheng Wu

TL;DR
VoxPrivacy introduces a new benchmark to evaluate how well Speech Language Models protect user privacy in multi-user environments, highlighting current vulnerabilities and proposing fine-tuning solutions.
Contribution
This paper presents VoxPrivacy, the first benchmark for interactional privacy in SLMs, along with a large dataset and fine-tuning methods to improve privacy-preserving responses.
Findings
Most open-source models perform near chance on privacy tasks.
Even strong closed-source models struggle with proactive privacy inference.
Fine-tuning on a large dataset enhances privacy capabilities without losing robustness.
Abstract
As Speech Language Models (SLMs) transition from personal devices to shared, multi-user environments such as smart homes, a new challenge emerges: the model is expected to distinguish between users to manage information flow appropriately. Without this capability, an SLM could reveal one user's confidential schedule to another, a privacy failure we term interactional privacy. Thus, the ability to generate speaker-aware responses becomes essential for SLM safe deployment. Current SLM benchmarks test dialogue ability but overlook speaker identity. Multi-speaker benchmarks check who said what without assessing whether SLMs adapt their responses. Privacy benchmarks focus on globally sensitive data (e.g., bank passwords) while neglecting contextual privacy-sensitive information (e.g., a user's private appointment). To address this gap, we introduce VoxPrivacy, the first benchmark designed to…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper addresses a novel and important problem in SLMs, interactional privacy in multi-user environments, which is underexplored. The introduction of the VoxPrivacy benchmark, based on a theoretically grounded definition of interactional privacy using Nissenbaum's Contextual Integrity. 2. The quality of the paper is good, the authors constructed a large-scale bilingual dataset with data synthesis, filtering, multi-model LLM generation, and human verification processes. The benchmark inclu
1. Synthetic dataset limitations (also acknowledged by the authors). The use of only synthetic, LLM-generated dialogues for privacy-sensitive utterances may reduce real-world relevance. The paper lacks user studies or comparisons with real data to confirm if synthetic secrets match actual privacy concerns. 2. Artificial dialogue structure.The fixed 3-turn dialogue pattern (secret statement → privacy instruction → probe) may not fully capture the richness and variability of natural conversations
- The paper examines contextual privacy leakage issues in speech language models and engages with the unique capabilities of SLM to process the voice which can uniquely identify a person. Hence, it makes sense to evaluate the end-to-end privacy protection for SLM. - The paper develops a benchmark covering both direct and indirect indicators of privacy information to perform a thorough evaluation of the privacy protection capabilities of closed-source and open-source models. - The evaluation reve
- I can't find a realistic grounding for the privacy violations in the benchmark. The benchmark assembles the specification, instruction, and probing queries into a multi-turn dialogue, which corresponds to the situation when multiple users converse with the SLM in the same session. In these cases, people already have equal access to the output of the model, which means the sensitivity of information in the output should be determined by everyone present in the conversation, rather than just the
- High novelty; The addressed research problem is novel, and this is, as far as I know, the first dataset and methodology for evaluating interactional privacy. - High quality; The proposed dataset is designed following principles of good design, the validation tests for the dataset are good, and the analysis of results is insightful. - Good clarity; Writing and argumentation are clear, with only minor blemishes. - High significance: As this work addresses an important problem that has not been
Main weaknesses: - Argumentation: Building a dataset for SLMs was motivated by the fact that spoken dialogues have plenty of contextual information that is not available in the text only. This is true; speech is a much more informative representation than text, and my informed guess is that much of the information related to interactional privacy is available only in the voice (not in text). That said, as data is here created through synthesis from text, there is no way to confirm that informati
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Speech Recognition and Synthesis · Mobile Crowdsensing and Crowdsourcing
