Domain-Specific Language Model Pretraining for Biomedical Natural   Language Processing

Yu Gu; Robert Tinn; Hao Cheng; Michael Lucas; Naoto Usuyama; Xiaodong; Liu; Tristan Naumann; Jianfeng Gao; and Hoifung Poon

arXiv:2007.15779·cs.CL·September 20, 2021

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong, Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon

PDF

2 Repos 10 Models 5 Datasets 1 Video

TL;DR

Pretraining language models from scratch on biomedical text significantly outperforms continual pretraining of general models, establishing new state-of-the-art results across various biomedical NLP tasks.

Contribution

This paper demonstrates the benefits of domain-specific pretraining from scratch for biomedical NLP and introduces a comprehensive benchmark and models for the community.

Findings

01

Pretraining from scratch yields better performance than continual pretraining.

02

Domain-specific models achieve state-of-the-art results on biomedical NLP tasks.

03

Simpler approaches can be effective, such as avoiding complex tagging schemes in NER.

Abstract

Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this paper, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly-available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

Domain-specific language model pretraining for biomedical natural language processing· youtube

Taxonomy

MethodsLinear Layer · Dense Connections · WordPiece · Residual Connection · Linear Warmup With Linear Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization · Attention Is All You Need · Adam · Dropout