Synthetic Data for Veterinary EHR De-identification: Benefits, Limits, and Safety Trade-offs Under Fixed Compute
David Brundage

TL;DR
This study evaluates the use of synthetic veterinary health records generated by large language models to improve de-identification safety, finding that synthetic data can augment but not replace real data for privacy protection.
Contribution
It demonstrates that synthetic data can enhance de-identification performance when used as augmentation, but cannot safely replace real data under fixed sample conditions.
Findings
Synthetic augmentation improves de-identification performance.
High synthetic dominance degrades utility and safety.
Synthetic-real mismatches contribute to leakage issues.
Abstract
Veterinary electronic health records (vEHRs) contain privacy-sensitive identifiers that limit secondary use. While PetEVAL provides a benchmark for veterinary de-identification, the domain remains low-resource. This study evaluates whether large language model (LLM)-generated synthetic narratives improve de-identification safety under distinct training regimes, emphasizing (i) synthetic augmentation and (ii) fixed-budget substitution. We conducted a controlled simulation using a PetEVAL-derived corpus (3,750 holdout/1,249 train). We generated 10,382 synthetic notes using a privacy-preserving "template-only" regime where identifiers were removed prior to LLM prompting. Three transformer backbones (PetBERT, VetBERT, Bio_ClinicalBERT) were trained under varying mixtures. Evaluation prioritized document-level leakage rate (the fraction of documents with at least one missed identifier) as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsElectronic Health Records Systems · Food Supply Chain Traceability · Data-Driven Disease Surveillance
