Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing
Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong, Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon

TL;DR
Pretraining language models from scratch on biomedical text significantly outperforms continual pretraining of general models, establishing new state-of-the-art results across various biomedical NLP tasks.
Contribution
This paper demonstrates the benefits of domain-specific pretraining from scratch for biomedical NLP and introduces a comprehensive benchmark and models for the community.
Findings
Pretraining from scratch yields better performance than continual pretraining.
Domain-specific models achieve state-of-the-art results on biomedical NLP tasks.
Simpler approaches can be effective, such as avoiding complex tagging schemes in NER.
Abstract
Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this paper, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly-available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗microsoft/BiomedNLP-BiomedBERT-base-uncased-abstractmodel· 1.0M dl· ♡ 901.0M dl♡ 90
- 🤗mervenoyan/PubMedBERT-QNLImodel· 9 dl· ♡ 89 dl♡ 8
- 🤗microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltextmodel· 280k dl· ♡ 315280k dl♡ 315
- 🤗Timofey/PubMedBERT_Cell_Components_Context_Classifiermodel
- 🤗Timofey/PubMedBERT_Diseases_Side_Effects_Context_Classifiermodel· ♡ 1♡ 1
- 🤗Timofey/PubMedBERT_Pathways_Context_Classifiermodel
- 🤗Timofey/PubMedBERT_Genes_Proteins_Context_Classifiermodel· ♡ 4♡ 4
- 🤗Timofey/PubMedBERT_Drugs_Metabolites_Context_Classifiermodel· ♡ 2♡ 2
- 🤗allenai/drug_combinations_lm_pubmedbertmodel· 5 dl· ♡ 25 dl♡ 2
- 🤗microsoft/BiomedNLP-BiomedBERT-large-uncased-abstractmodel· 5.0k dl· ♡ 215.0k dl♡ 21
Videos
Domain-specific language model pretraining for biomedical natural language processing· youtube
Taxonomy
MethodsLinear Layer · Dense Connections · WordPiece · Residual Connection · Linear Warmup With Linear Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization · Attention Is All You Need · Adam · Dropout
