Deep learning-based NLP Data Pipeline for EHR Scanned Document Information Extraction
Enshuo Hsu (1, 3, and 4), Ioannis Malagaris (1), Yong-Fang Kuo (1),, Rizwana Sultana (2), Kirk Roberts (3) ((1) Office of Biostatistics, (2), Division of Pulmonary, Critical Care, Sleep Medicine, Department of, Internal Medicine, University of Texas Medical Branch, Galveston

TL;DR
This study evaluates a deep learning NLP pipeline for extracting sleep apnea indicators from scanned EHR documents, emphasizing the importance of image preprocessing and document layout in improving extraction accuracy.
Contribution
It systematically assesses the impact of image preprocessing, NLP models, and document layout features on extracting clinical indicators from scanned health records.
Findings
Clinical BERT achieved AUROC of 0.9743 for AHI
Proper image preprocessing improves extraction accuracy
Document layout information enhances model performance
Abstract
Scanned documents in electronic health records (EHR) have been a challenge for decades, and are expected to stay in the foreseeable future. Current approaches for processing often include image preprocessing, optical character recognition (OCR), and text mining. However, there is limited work that evaluates the choice of image preprocessing methods, the selection of NLP models, and the role of document layout. The impact of each element remains unknown. We evaluated this method on a use case of two key indicators for sleep apnea, Apnea hypopnea index (AHI) and oxygen saturation (SaO2) values, from scanned sleep study reports. Our data that included 955 manually annotated reports was secondarily utilized from a previous study in the University of Texas Medical Branch. We performed image preprocessing: gray-scaling followed by 1 iteration of dilating and erode, and 20% contrast…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Biomedical Text Mining and Ontologies · Machine Learning in Healthcare
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Softmax · Residual Connection · WordPiece · Dense Connections · Linear Warmup With Linear Decay · Weight Decay
