Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT
Usman Naseem, Adam G. Dunn, Matloob Khushi, Jinman Kim

TL;DR
This paper introduces BioALBERT, a domain-specific language model trained on biomedical and clinical corpora, which achieves state-of-the-art results across multiple biomedical NLP tasks and datasets, setting new benchmarks for the field.
Contribution
BioALBERT is the first domain-specific adaptation of ALBERT trained on biomedical and clinical data, significantly improving performance on diverse NLP tasks and establishing new baseline benchmarks.
Findings
BioALBERT outperforms previous models on 17 of 20 datasets.
Achieves up to 11.09% improvement in named entity recognition.
Provides publicly available models and data for the biomedical NLP community.
Abstract
The availability of biomedical text data and advances in natural language processing (NLP) have made new applications in biomedical NLP possible. Language models trained or fine tuned using domain specific corpora can outperform general models, but work to date in biomedical NLP has been limited in terms of corpora and tasks. We present BioALBERT, a domain-specific adaptation of A Lite Bidirectional Encoder Representations from Transformers (ALBERT), trained on biomedical (PubMed and PubMed Central) and clinical (MIMIC-III) corpora and fine tuned for 6 different tasks across 20 benchmark datasets. Experiments show that BioALBERT outperforms the state of the art on named entity recognition (+11.09% BLURB score improvement), relation extraction (+0.80% BLURB score), sentence similarity (+1.05% BLURB score), document classification (+0.62% F1-score), and question answering (+2.83% BLURB…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Natural Language Processing Techniques
