User Perceptions vs. Proxy LLM Judges: Privacy and Helpfulness in LLM Responses to Privacy-Sensitive Scenarios

Xiaoyuan Wu; Roshni Kaushik; Wenkai Li; Lujo Bauer; Koichi Onoue

arXiv:2510.20721·cs.CL·January 16, 2026

User Perceptions vs. Proxy LLM Judges: Privacy and Helpfulness in LLM Responses to Privacy-Sensitive Scenarios

Xiaoyuan Wu, Roshni Kaushik, Wenkai Li, Lujo Bauer, Koichi Onoue

PDF

Open Access

TL;DR

This study reveals that proxy LLM judges do not accurately reflect user perceptions of privacy and helpfulness in sensitive scenarios, highlighting the need for user-centered evaluation methods.

Contribution

The paper demonstrates that proxy LLMs poorly correlate with user perceptions, emphasizing the importance of direct user studies for evaluating privacy and utility in LLM responses.

Findings

01

Users show low agreement on response evaluations.

02

Proxy LLMs have high agreement but low correlation with users.

03

Need for user-centered evaluation of LLM privacy and helpfulness.

Abstract

Large language models (LLMs) are rapidly being adopted for tasks like drafting emails, summarizing meetings, and answering health questions. In these settings, users may need to share private information (e.g., contact details, health records). To evaluate LLMs' ability to identify and redact such information, prior work introduced real-life, scenario-based benchmarks (e.g., ConfAIde, PrivacyLens) and found that LLMs can leak private information in complex scenarios. However, these evaluations relied on proxy LLMs to judge the helpfulness and privacy-preservation quality of LLM responses, rather than directly measuring users' perceptions. To understand how users perceive the helpfulness and privacy-preservation quality of LLM responses to privacy-sensitive scenarios, we conducted a user study ( $n = 94$ ) using 90 PrivacyLens scenarios. We found that users had low agreement with each other…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Text Readability and Simplification