Detecting automatically the layout of clinical documents to enhance the performances of downstream natural language processing
Christel G\'erardin, Perceval Wajsb\"urt, Basile Dura, Alice Calliger,, Alexandre Moucher, Xavier Tannier, Romain Bey

TL;DR
This paper presents a novel algorithm for analyzing the layout of clinical PDF documents to improve the extraction of relevant medical text, thereby enhancing downstream natural language processing tasks in healthcare.
Contribution
The study introduces a new deep learning-based layout analysis algorithm specifically designed for clinical PDFs, demonstrating improved text extraction and medical concept detection performance.
Findings
Achieved 98.4% precision in body text extraction
Improved medical concept extraction accuracy in clinical documents
Validated system enhances downstream NLP tasks in healthcare
Abstract
Objective:Develop and validate an algorithm for analyzing the layout of PDF clinical documents to improve the performance of downstream natural language processing tasks. Materials and Methods: We designed an algorithm to process clinical PDF documents and extract only clinically relevant text. The algorithm consists of several steps: initial text extraction using a PDF parser, followed by classification into categories such as body text, left notes, and footers using a Transformer deep neural network architecture, and finally an aggregation step to compile the lines of a given label in the text. We evaluated the technical performance of the body text extraction algorithm by applying it to a random sample of documents that were annotated. Medical performance was evaluated by examining the extraction of medical concepts of interest from the text in their respective sections. Finally, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEdcuational Technology Systems · Biomedical Text Mining and Ontologies · Artificial Intelligence in Healthcare
MethodsAttention Is All You Need · Absolute Position Encodings · Softmax · Layer Normalization · Byte Pair Encoding · Dropout · Linear Layer · Label Smoothing · Multi-Head Attention · Adam
