LLMs-in-the-Loop Part 2: Expert Small AI Models for Anonymization and De-identification of PHI Across Multiple Languages
Murat Gunay, Bunyamin Keles, Raife Hizlan

TL;DR
This paper presents expert small AI models for multilingual PHI de-identification that outperform large language models, ensuring privacy and reliability in healthcare data processing with high accuracy across eight languages.
Contribution
It introduces domain-specific small NER models using LLM-in-the-loop methodology, achieving superior performance and privacy advantages over existing large models.
Findings
Achieved high F1 scores (around 0.95-0.98) across eight languages.
Outperformed GPT-4 and other small models in de-identification tasks.
Demonstrated cost-effective, privacy-preserving healthcare data anonymization.
Abstract
The rise of chronic diseases and pandemics like COVID-19 has emphasized the need for effective patient data processing while ensuring privacy through anonymization and de-identification of protected health information (PHI). Anonymized data facilitates research without compromising patient confidentiality. This paper introduces expert small AI models developed using the LLM-in-the-loop methodology to meet the demand for domain-specific de-identification NER models. These models overcome the privacy risks associated with large language models (LLMs) used via APIs by eliminating the need to transmit or store sensitive data. More importantly, they consistently outperform LLMs in de-identification tasks, offering superior performance and reliability. Our de-identification NER models, developed in eight languages (English, German, Italian, French, Romanian, Turkish, Spanish, and Arabic)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
