Generation and De-Identification of Indian Clinical Discharge Summaries using LLMs
Sanjeet Singh, Shreya Gupta, Niralee Gupta, Naimish Sharma, and Lokesh Srivastava, Vibhu Agarwal, Ashutosh Modi

TL;DR
This paper investigates the challenges of de-identifying Indian clinical discharge summaries using language models, highlighting issues of generalization and proposing synthetic data generation with LLMs to improve de-identification performance.
Contribution
It introduces a novel approach of using LLM-generated synthetic reports to enhance de-identification systems for Indian healthcare data, addressing data scarcity and cross-institutional generalization.
Findings
De-identification algorithms trained on non-Indian data perform poorly on Indian datasets.
Off-the-shelf de-identification tools pose risks due to lack of adaptation.
Synthetic reports generated by LLMs improve de-identification performance and generalization.
Abstract
The consequences of a healthcare data breach can be devastating for the patients, providers, and payers. The average financial impact of a data breach in recent months has been estimated to be close to USD 10 million. This is especially significant for healthcare organizations in India that are managing rapid digitization while still establishing data governance procedures that align with the letter and spirit of the law. Computer-based systems for de-identification of personal information are vulnerable to data drift, often rendering them ineffective in cross-institution settings. Therefore, a rigorous assessment of existing de-identification against local health datasets is imperative to support the safe adoption of digital health initiatives in India. Using a small set of de-identified patient discharge summaries provided by an Indian healthcare institution, in this paper, we report…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Biomedical Text Mining and Ontologies · Topic Modeling
MethodsSparse Evolutionary Training · ALIGN
