Natural Language Generation for Electronic Health Records
Scott Lee

TL;DR
This paper introduces an encoder-decoder deep learning model for generating realistic, de-identified unstructured text in electronic health records, enhancing data sharing and privacy preservation.
Contribution
The study demonstrates a novel application of encoder-decoder models to generate synthetic unstructured EHR text, addressing limitations of existing methods.
Findings
Generated chief complaints preserve epidemiological information.
Synthetic text is free of PII and common misspellings.
Potential to support de-identification and synthetic EHR generation.
Abstract
A variety of methods existing for generating synthetic electronic health records (EHRs), but they are not capable of generating unstructured text, like emergency department (ED) chief complaints, history of present illness or progress notes. Here, we use the encoder-decoder model, a deep learning algorithm that features in many contemporary machine translation systems, to generate synthetic chief complaints from discrete variables in EHRs, like age group, gender, and discharge diagnosis. After being trained end-to-end on authentic records, the model can generate realistic chief complaint text that preserves much of the epidemiological information in the original data. As a side effect of the model's optimization goal, these synthetic chief complaints are also free of relatively uncommon abbreviation and misspellings, and they include none of the personally-identifiable information (PII)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare · Biomedical Text Mining and Ontologies
