Biomedical Language Models are Robust to Sub-optimal Tokenization

Bernal Jim\'enez Guti\'errez; Huan Sun; Yu Su

arXiv:2306.17649·cs.CL·July 11, 2023

Biomedical Language Models are Robust to Sub-optimal Tokenization

Bernal Jim\'enez Guti\'errez, Huan Sun, Yu Su

PDF

Open Access 1 Repo 2 Models

TL;DR

This study investigates whether using more accurate biomedical tokenizers improves language model performance on biomedical NLP tasks, finding that pre-training is robust even with sub-optimal tokenization.

Contribution

It demonstrates that biomedical language models are surprisingly unaffected by sub-optimal tokenization during pre-training, challenging assumptions about the importance of tokenizer accuracy.

Findings

01

Standard biomedical tokenizers often fail to segment terms into meaningful parts.

02

Using a more accurate biomedical tokenizer does not significantly improve model performance.

03

Biomedical pre-training is robust to sub-optimal tokenization, as shown by various evaluation metrics.

Abstract

As opposed to general English, many concepts in biomedical terminology have been designed in recent history by biomedical professionals with the goal of being precise and concise. This is often achieved by concatenating meaningful biomedical morphemes to create new semantic units. Nevertheless, most modern biomedical language models (LMs) are pre-trained using standard domain-specific tokenizers derived from large scale biomedical corpus statistics without explicitly leveraging the agglutinating nature of biomedical language. In this work, we first find that standard open-domain and biomedical tokenizers are largely unable to segment biomedical terms into meaningful components. Therefore, we hypothesize that using a tokenizer which segments biomedical terminology more accurately would enable biomedical LMs to improve their performance on downstream biomedical NLP tasks, especially ones…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

osu-nlp-group/bio-tokenization
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Topic Modeling · Natural Language Processing Techniques