Automatic identification of diagnosis from hospital discharge letters via weakly-supervised Natural Language Processing
Vittorio Torri, Elisa Barbieri, Anna Cantarutti, Carlo Giaquinto, Francesca Ieva

TL;DR
This paper introduces a weakly-supervised NLP pipeline that accurately classifies diagnoses from Italian discharge letters, significantly reducing manual annotation effort while maintaining high performance and adaptability across diseases.
Contribution
The study presents a novel weakly-supervised approach using clustering and transformer models to classify discharge letters without manual labels, improving scalability and efficiency.
Findings
Achieved an AUC of 77.68% and F1-score of 78.14% in classifying bronchiolitis.
Surpassed other unsupervised methods and approached supervised model performance.
Saved over 1,500 hours of expert annotation time for large datasets.
Abstract
Identifying patient diagnoses from discharge letters is essential to enable large-scale cohort selection and epidemiological research, but traditional supervised approaches rely on extensive manual annotation, which is often impractical for large textual datasets. In this study, we present a novel weakly-supervised Natural Language Processing pipeline designed to classify Italian discharge letters without requiring manual labelling. After extracting diagnosis-related sentences, the method leverages a transformer-based model with an additional pre-training on Italian medical documents to generate semantic embeddings. A two-level clustering procedure is applied to these embeddings, and the resulting clusters are mapped to the diseases of interest to derive weak labels for a subset of data, eventually used to train a transformer-based classifier. We evaluate the approach on a real-world…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques
