Developing a general-purpose clinical language inference model from a large corpus of clinical notes
Madhumita Sushil, Dana Ludwig, Atul J. Butte, Vivek A., Rudrapatna

TL;DR
This study trained a large clinical language model using a vast corpus of deidentified notes, demonstrating improved performance on clinical inference tasks and highlighting the benefits of domain-specific vocabulary and extensive training data.
Contribution
It introduces a new clinical language model trained on 75 million notes, showing enhanced inference capabilities over existing models, especially in domain-specific tasks.
Findings
Model performs on par with top biomedical models on benchmark tasks.
Significantly outperforms comparable models on UCSF-specific tasks.
In-domain vocabulary improves encoding of longer documents.
Abstract
Several biomedical language models have already been developed for clinical language inference. However, these models typically utilize general vocabularies and are trained on relatively small clinical corpora. We sought to evaluate the impact of using a domain-specific vocabulary and a large clinical training corpus on the performance of these language models in clinical language inference. We trained a Bidirectional Encoder Decoder from Transformers (BERT) model using a diverse, deidentified corpus of 75 million deidentified clinical notes authored at the University of California, San Francisco (UCSF). We evaluated this model on several clinical language inference benchmark tasks: clinical and temporal concept recognition, relation extraction and medical language inference. We also evaluated our model on two tasks using discharge summaries from UCSF: diagnostic code assignment and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Natural Language Processing Techniques
