Infusing clinical knowledge into tokenisers for language models
Abul Hasan, Jinge Wu, Quang Ngoc Nguyen, Salom\'e Andres, Imane, Guellil, Huayu Zhang, Arlene Casey, Beatrice Alex, Bruce Guthrie, Honghan Wu

TL;DR
This paper presents K-Tokeniser, a knowledge-enhanced tokenisation method for clinical texts that improves model performance and training efficiency without pretraining, by integrating domain knowledge into token representations.
Contribution
The study introduces K-Tokeniser, a novel semantic-based tokenisation approach that leverages domain ontologies and context to enhance clinical language model performance.
Findings
Consistent improvements across multiple clinical NLP tasks.
13% increase in Micro F1 score for clinical coding.
Models require significantly less data to achieve optimal performance.
Abstract
This study introduces a novel knowledge enhanced tokenisation mechanism, K-Tokeniser, for clinical text processing. Technically, at initialisation stage, K-Tokeniser populates global representations of tokens based on semantic types of domain concepts (such as drugs or diseases) from either a domain ontology like Unified Medical Language System or the training data of the task related corpus. At training or inference stage, sentence level localised context will be utilised for choosing the optimal global token representation to realise the semantic-based tokenisation. To avoid pretraining using the new tokeniser, an embedding initialisation approach is proposed to generate representations for new tokens. Using three transformer-based language models, a comprehensive set of experiments are conducted on four real-world datasets for evaluating K-Tokeniser in a wide range of clinical text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Artificial Intelligence in Healthcare and Education
MethodsSparse Evolutionary Training · Ontology
