Generation and De-Identification of Indian Clinical Discharge Summaries   using LLMs

Sanjeet Singh; Shreya Gupta; Niralee Gupta; Naimish Sharma; and Lokesh Srivastava; Vibhu Agarwal; Ashutosh Modi

arXiv:2407.05887·cs.CL·July 9, 2024

Generation and De-Identification of Indian Clinical Discharge Summaries using LLMs

Sanjeet Singh, Shreya Gupta, Niralee Gupta, Naimish Sharma, and Lokesh Srivastava, Vibhu Agarwal, Ashutosh Modi

PDF

Open Access 1 Repo

TL;DR

This paper investigates the challenges of de-identifying Indian clinical discharge summaries using language models, highlighting issues of generalization and proposing synthetic data generation with LLMs to improve de-identification performance.

Contribution

It introduces a novel approach of using LLM-generated synthetic reports to enhance de-identification systems for Indian healthcare data, addressing data scarcity and cross-institutional generalization.

Findings

01

De-identification algorithms trained on non-Indian data perform poorly on Indian datasets.

02

Off-the-shelf de-identification tools pose risks due to lack of adaptation.

03

Synthetic reports generated by LLMs improve de-identification performance and generalization.

Abstract

The consequences of a healthcare data breach can be devastating for the patients, providers, and payers. The average financial impact of a data breach in recent months has been estimated to be close to USD 10 million. This is especially significant for healthcare organizations in India that are managing rapid digitization while still establishing data governance procedures that align with the letter and spirit of the law. Computer-based systems for de-identification of personal information are vulnerable to data drift, often rendering them ineffective in cross-institution settings. Therefore, a rigorous assessment of existing de-identification against local health datasets is imperative to support the safe adoption of digital health initiatives in India. Using a small set of de-identified patient discharge summaries provided by an Indian healthcare institution, in this paper, we report…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

exploration-lab/llm-for-clinical-report-generation-deidentification
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Biomedical Text Mining and Ontologies · Topic Modeling

MethodsSparse Evolutionary Training · ALIGN