Towards Fair and Efficient De-identification: Quantifying the Efficiency and Generalizability of De-identification Approaches
Noopur Zambare, Kiana Aghakasiri, Carissa Lin, Carrie Ye, J. Ross Mitchell, Mohamed Abdalla

TL;DR
This paper evaluates various transformer and large language models for clinical de-identification, demonstrating that smaller models can be both efficient and effective across multiple languages and cultural contexts, with the introduction of a new multi-cultural de-identification model.
Contribution
It systematically compares model sizes and languages for de-identification, and introduces BERT-MultiCulture-DEID to enhance robustness across cultures and languages.
Findings
Smaller models achieve comparable performance with lower inference costs.
Limited data fine-tuning enables smaller models to outperform larger ones in multilingual de-identification.
The new BERT-MultiCulture-DEID improves robustness across diverse languages and cultural contexts.
Abstract
Large language models (LLMs) have shown strong performance on clinical de-identification, the task of identifying sensitive identifiers to protect privacy. However, previous work has not examined their generalizability between formats, cultures, and genders. In this work, we systematically evaluate fine-tuned transformer models (BERT, ClinicalBERT, ModernBERT), small LLMs (Llama 1-8B, Qwen 1.5-7B), and large LLMs (Llama-70B, Qwen-72B) at de-identification. We show that smaller models achieve comparable performance while substantially reducing inference cost, making them more practical for deployment. Moreover, we demonstrate that smaller models can be fine-tuned with limited data to outperform larger models in de-identifying identifiers drawn from Mandarin, Hindi, Spanish, French, Bengali, and regional variations of English, in addition to gendered names. To improve robustness in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMachine Learning in Healthcare · Privacy-Preserving Technologies in Data · Electronic Health Records Systems
