Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records
Qingyu Chen, Jingcheng Du, Sun Kim, W. John Wilbur, Zhiyong Lu

TL;DR
This paper demonstrates that deep learning models using sentence embeddings pre-trained on biomedical texts significantly improve the accuracy of identifying similar sentences in electronic medical records, surpassing previous models.
Contribution
It introduces the use of biomedical domain-specific pre-trained sentence embeddings to enhance deep learning models for semantic similarity in clinical texts.
Findings
Ensemble of models achieved a correlation of 0.8528.
Pre-trained biomedical embeddings improved model performance by ~13%.
Deep learning models outperformed traditional machine learning models with manual features.
Abstract
Capturing sentence semantics plays a vital role in a range of text mining applications. Despite continuous efforts on the development of related datasets and models in the general domain, both datasets and models are limited in biomedical and clinical domains. The BioCreative/OHNLP organizers have made the first attempt to annotate 1,068 sentence pairs from clinical notes and have called for a community effort to tackle the Semantic Textual Similarity (BioCreative/OHNLP STS) challenge. We developed models using traditional machine learning and deep learning approaches. For the post challenge, we focus on two models: the Random Forest and the Encoder Network. We applied sentence embeddings pre-trained on PubMed abstracts and MIMIC-III clinical notes and updated the Random Forest and the Encoder Network accordingly. The official results demonstrated our best submission was the ensemble of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Natural Language Processing Techniques
