How to Train Private Clinical Language Models: A Comparative Study of Privacy-Preserving Pipelines for ICD-9 Coding
Mathieu Dufour, Andrew Duncan

TL;DR
This study systematically compares four privacy-preserving training pipelines for clinical language models, finding knowledge distillation offers the best privacy-utility balance for ICD-9 coding tasks.
Contribution
It provides the first head-to-head comparison of privacy-preserving methods for clinical NLP, highlighting knowledge distillation as the most effective approach.
Findings
Knowledge distillation outperforms other methods at moderate privacy levels.
Up to 63% of non-private performance can be recovered with knowledge distillation.
Knowledge distillation maintains strong privacy, with membership inference AUC around 0.5.
Abstract
Large language models trained on clinical text risk exposing sensitive patient information, yet differential privacy (DP) methods often severely degrade the diagnostic accuracy needed for deployment. Despite rapid progress in DP optimisation and text generation, it remains unclear which privacy-preserving strategy actually works best for clinical language tasks. We present the first systematic head-to-head comparison of four training pipelines for automated diagnostic coding from hospital discharge summaries. All pipelines use identical 1B-parameter models and matched privacy budgets to predict ICD-9 codes. At moderate and relaxed privacy budgets (), knowledge distillation from DP-trained teachers outperforms both direct DP-SGD and DP-synthetic data training, recovering up to 63\% of the non-private performance whilst maintaining strong empirical privacy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Machine Learning in Healthcare · Artificial Intelligence in Healthcare and Education
