SHIELD: A Diverse Clinical Note Dataset and Distilled Small Language Models for Enterprise-Scale De-identification

Jose D. Posada; David Love; Somalee Datta; Priya Desai

arXiv:2605.03301·cs.CL·May 6, 2026

SHIELD: A Diverse Clinical Note Dataset and Distilled Small Language Models for Enterprise-Scale De-identification

Jose D. Posada, David Love, Somalee Datta, Priya Desai

PDF

TL;DR

This paper introduces SHIELD, a diverse clinical note dataset, and develops distilled small language models for effective, enterprise-scale de-identification of electronic health records, addressing limitations of existing benchmarks and models.

Contribution

The paper presents a new diverse dataset for clinical de-identification and demonstrates how to distill large language models into efficient small models suitable for enterprise deployment.

Findings

01

SHIELD dataset contains 1,394 notes with 10,505 PHI spans across 9 categories.

02

Distilled models achieve 0.88 precision and 0.86 recall on PHI span detection.

03

Distilled models generalize well across datasets but struggle with institution-specific entities.

Abstract

De-identification of clinical text remains essential for secondary use of electronic health records (EHRs), yet public benchmarks such as i2b2 2006/2014 are over a decade old and lack the semantic and demographic diversity of modern narratives. While Large Language Models (LLMs) achieve state-of-the-art zero-shot extraction, enterprise deployment is hindered by compute costs and governance restricting Protected Health Information (PHI) from cloud APIs. We introduce SHIELD (Synthetic Human-annotated Identifier-replaced Entries for Learning and De-identification), a diverse dataset of 1,394 notes with 10,505 gold-standard PHI spans across 9 categories, built via set-cover diversity sampling with human-in-the-loop adjudication. We evaluate four LLMs (two proprietary, two open-weight) to establish a performance ceiling, then distill these capabilities into locally deployable Small Language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.