Developing Healthcare Language Model Embedding Spaces
Niall Taylor, Dan Schofield, Andrey Kormilitzin, Dan W Joyce, Alejo, Nevado-Holgado

TL;DR
This paper investigates methods to adapt small pre-trained language models for healthcare text, demonstrating that contrastive learning enhances classification performance and embedding quality, with implications for resource-efficient, domain-specific medical NLP applications.
Contribution
The study introduces a contrastive pre-training approach and evaluates metadata-based objectives for healthcare LLM adaptation, providing guidelines for efficient domain-specific model development.
Findings
Contrastively trained models outperform other methods on classification tasks.
Domain-adapted LLMs surpass general base LLMs in healthcare tasks.
Metadata pre-training improves embedding cluster separability.
Abstract
Pre-trained Large Language Models (LLMs) often struggle on out-of-domain datasets like healthcare focused text. We explore specialized pre-training to adapt smaller LLMs to different healthcare datasets. Three methods are assessed: traditional masked language modeling, Deep Contrastive Learning for Unsupervised Textual Representations (DeCLUTR), and a novel pre-training objective utilizing metadata categories from the healthcare settings. These schemes are evaluated on downstream document classification tasks for each dataset, with additional analysis of the resultant embedding spaces. Contrastively trained models outperform other approaches on the classification tasks, delivering strong performance from limited labeled data and with fewer model parameter updates required. While metadata-based pre-training does not further improve classifications across the datasets, it yields…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Electronic Health Records Systems · Semantic Web and Ontologies
MethodsContrastive Learning · ALIGN · Balanced Selection
