Benchmarking for Biomedical Natural Language Processing Tasks with a   Domain Specific ALBERT

Usman Naseem; Adam G. Dunn; Matloob Khushi; Jinman Kim

arXiv:2107.04374·cs.CL·July 12, 2021

Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT

Usman Naseem, Adam G. Dunn, Matloob Khushi, Jinman Kim

PDF

Open Access 1 Repo

TL;DR

This paper introduces BioALBERT, a domain-specific language model trained on biomedical and clinical corpora, which achieves state-of-the-art results across multiple biomedical NLP tasks and datasets, setting new benchmarks for the field.

Contribution

BioALBERT is the first domain-specific adaptation of ALBERT trained on biomedical and clinical data, significantly improving performance on diverse NLP tasks and establishing new baseline benchmarks.

Findings

01

BioALBERT outperforms previous models on 17 of 20 datasets.

02

Achieves up to 11.09% improvement in named entity recognition.

03

Provides publicly available models and data for the biomedical NLP community.

Abstract

The availability of biomedical text data and advances in natural language processing (NLP) have made new applications in biomedical NLP possible. Language models trained or fine tuned using domain specific corpora can outperform general models, but work to date in biomedical NLP has been limited in terms of corpora and tasks. We present BioALBERT, a domain-specific adaptation of A Lite Bidirectional Encoder Representations from Transformers (ALBERT), trained on biomedical (PubMed and PubMed Central) and clinical (MIMIC-III) corpora and fine tuned for 6 different tasks across 20 benchmark datasets. Experiments show that BioALBERT outperforms the state of the art on named entity recognition (+11.09% BLURB score improvement), relation extraction (+0.80% BLURB score), sentence similarity (+1.05% BLURB score), document classification (+0.62% F1-score), and question answering (+2.83% BLURB…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

usmaann/BioALBERT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Natural Language Processing Techniques