MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-End Question Answering

Shaowei Guan; Yu Zhai; Hin Chi Kwok; Jiawei Du; Xinyu Feng; Jing Li; Harry Qin; Vivian Hui

arXiv:2603.14265·cs.CL·March 17, 2026

MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-End Question Answering

Shaowei Guan, Yu Zhai, Hin Chi Kwok, Jiawei Du, Xinyu Feng, Jing Li, Harry Qin, Vivian Hui

PDF

Open Access

TL;DR

MedPriv-Bench introduces a novel benchmark for evaluating the balance between privacy preservation and clinical utility in large language models used for medical question answering, addressing a critical gap in healthcare AI safety.

Contribution

This work presents the first benchmark specifically designed to assess privacy risks alongside utility in medical LLMs, using a multi-agent pipeline and automated privacy leakage evaluation.

Findings

01

Identifies a pervasive privacy-utility trade-off in medical LLMs.

02

Automated evaluation aligns 85.9% with human expert judgments.

03

Highlights the need for domain-specific benchmarks in healthcare AI.

Abstract

Recent advances in Retrieval-Augmented Generation (RAG) have enabled large language models (LLMs) to ground outputs in clinical evidence. However, connecting LLMs with external databases introduces the risk of contextual leakage: a subtle privacy threat where unique combinations of medical details enable patient re-identification even without explicit identifiers. Current benchmarks in healthcare heavily focus on accuracy, ignoring such privacy issues, despite strict regulations like Health Insurance Portability and Accountability Act (HIPAA) and General Data Protection Regulation (GDPR). To fill this gap, we present MedPriv-Bench, the first benchmark specifically designed to jointly evaluate privacy preservation and clinical utility in medical open-ended question answering. Our framework utilizes a multi-agent, human-in-the-loop pipeline to synthesize sensitive medical contexts and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Topic Modeling