# ASQ-PHI: An adversarial synthetic data benchmark for clinical de-identification and search utility

**Authors:** James Weatherhead, George Golovko, Peter McCaffrey

PMC · DOI: 10.1016/j.dib.2026.112586 · Data in Brief · 2026-02-11

## TL;DR

ASQ-PHI is a synthetic dataset for de-identifying clinical search queries to ensure HIPAA compliance when using large language models.

## Contribution

The paper introduces ASQ-PHI, a novel synthetic benchmark for de-identification in clinical search queries.

## Key findings

- ASQ-PHI contains 1051 synthetic clinical search queries with PHI annotations for HIPAA Safe Harbor compliance.
- The dataset includes 2973 PHI elements across 13 identifier types, enabling evaluation of PHI removal and over-redaction.
- An adversarial few-shot prompting pipeline using GPT-4o was used to generate the dataset.

## Abstract

Hospitals and vendors now run HIPAA-compliant Business Associate Agreement (BAA) large language models (LLMs) for clinical work. These systems do not use input data for further training, so clinicians can enter Protected Health Information (PHI) into them. LLMs are trained on a fixed corpus with a historical cutoff, therefore their answers often need to be supplemented with more recent clinical evidence from external sources such as live web search or other tools that are often not covered by a BAA. This creates a “safe handoff” point where a clinician’s PHI-containing query must be transformed into a HIPAA Safe Harbor compliant version before leaving the protected environment. However, publicly shareable datasets for this setting are scarce; this article describes PHI-rich clinician-style questions paired with HIPAA Safe Harbor annotations at the point where an external tool is called. Existing de-identification benchmarks are typically built from long electronic health record narratives such as discharge summaries and clinic notes, rather than from short, compressed search-style queries such as those that might be used in chat-based clinical LLM interfaces. ASQ-PHI (Adversarial Synthetic Queries for Protected Health Information de-identification) is a fully synthetic benchmark dataset designed for this safe handoff setting; no real patient data, electronic health records, or protected health information were accessed, used, or referenced during dataset creation. It contains 1051 single-turn clinical search queries that are designed to resemble prompts that clinicians might enter into HIPAA-compliant LLMs. Each record uses machine-parsable delimiters to separate the free text query from PHI annotations, which are provided as one JSON object per element specifying the HIPAA Safe Harbor identifier category and exact string value. The corpus includes 832 PHI-positive queries (79.2%) and 219 hard negatives (20.8%) engineered to mimic PHI-like syntax while containing only non-identifying clinical information such as ages under 90 years, diagnoses, medications, and symptoms. Across the dataset, there are 2973 PHI elements labeled from 13 textual HIPAA Safe Harbor identifier types that can be represented as short alphanumeric strings in single-line clinical questions, supporting the measurement of both PHI removal and over-redaction on PHI-free queries. All queries were generated with an adversarial few-shot prompting pipeline using Azure OpenAI GPT-4o. The associated Mendeley Data repository provides the complete dataset file, a Jupyter notebook that implements the generation pipeline, summary statistics, baseline metrics for a commercial PHI detection service, and six figures that describe the dataset. ASQ-PHI is released under an MIT license.

## Full-text entities

- **Genes:** GPI (glucose-6-phosphate isomerase) [NCBI Gene 2821] {aka AMF, CNSHA4, GNPI, NLK, PGI, PHI}
- **Chemicals:** ASQ-PHI (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12926592/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12926592/full.md

## References

13 references — full list in the complete paper: https://tomesphere.com/paper/PMC12926592/full.md

---
Source: https://tomesphere.com/paper/PMC12926592