Does BERT Pretrained on Clinical Notes Reveal Sensitive Data?

Eric Lehman; Sarthak Jain; Karl Pichotta; Yoav Goldberg; Byron C.; Wallace

arXiv:2104.07762·cs.CL·April 26, 2021

Does BERT Pretrained on Clinical Notes Reveal Sensitive Data?

Eric Lehman, Sarthak Jain, Karl Pichotta, Yoav Goldberg, Byron C., Wallace

PDF

4 Repos

TL;DR

This study investigates whether BERT models trained on clinical notes can inadvertently reveal sensitive patient information, finding simple methods ineffective but highlighting potential risks with more advanced attacks.

Contribution

The paper introduces probing techniques to assess privacy risks in BERT models trained on clinical data and provides a baseline for future research on data exposure risks.

Findings

01

Simple probing methods do not extract sensitive info from BERT trained on EHR.

02

Advanced attack methods may succeed in extracting personal health data.

03

Experimental setup and baseline models are publicly available for further research.

Abstract

Large Transformers pretrained over clinical notes from Electronic Health Records (EHR) have afforded substantial gains in performance on predictive clinical tasks. The cost of training such models (and the necessity of data access to do so) coupled with their utility motivates parameter sharing, i.e., the release of pretrained models such as ClinicalBERT. While most efforts have used deidentified EHR, many researchers have access to large sets of sensitive, non-deidentified EHR with which they might train a BERT model (or similar). Would it be safe to release the weights of such a model if they did? In this work, we design a battery of approaches intended to recover Personal Health Information (PHI) from a trained BERT. Specifically, we attempt to recover patient names and conditions with which they are associated. We find that simple probing methods are not able to meaningfully extract…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Dropout · Adam · Dense Connections · Softmax · Linear Warmup With Linear Decay · Attention Dropout