Biomedical Language Models are Robust to Sub-optimal Tokenization
Bernal Jim\'enez Guti\'errez, Huan Sun, Yu Su

TL;DR
This study investigates whether using more accurate biomedical tokenizers improves language model performance on biomedical NLP tasks, finding that pre-training is robust even with sub-optimal tokenization.
Contribution
It demonstrates that biomedical language models are surprisingly unaffected by sub-optimal tokenization during pre-training, challenging assumptions about the importance of tokenizer accuracy.
Findings
Standard biomedical tokenizers often fail to segment terms into meaningful parts.
Using a more accurate biomedical tokenizer does not significantly improve model performance.
Biomedical pre-training is robust to sub-optimal tokenization, as shown by various evaluation metrics.
Abstract
As opposed to general English, many concepts in biomedical terminology have been designed in recent history by biomedical professionals with the goal of being precise and concise. This is often achieved by concatenating meaningful biomedical morphemes to create new semantic units. Nevertheless, most modern biomedical language models (LMs) are pre-trained using standard domain-specific tokenizers derived from large scale biomedical corpus statistics without explicitly leveraging the agglutinating nature of biomedical language. In this work, we first find that standard open-domain and biomedical tokenizers are largely unable to segment biomedical terms into meaningful components. Therefore, we hypothesize that using a tokenizer which segments biomedical terminology more accurately would enable biomedical LMs to improve their performance on downstream biomedical NLP tasks, especially ones…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Topic Modeling · Natural Language Processing Techniques
