Enhancing Clinical Concept Extraction with Contextual Embeddings
Yuqi Si, Jingqi Wang, Hua Xu, Kirk Roberts

TL;DR
This paper demonstrates that contextual embeddings like ELMo and BERT, especially when trained on clinical data, significantly improve clinical concept extraction performance over traditional methods, setting new state-of-the-art results.
Contribution
It systematically compares traditional and contextual embeddings for clinical concept extraction and introduces an intuitive method to interpret semantic information in contextual embeddings.
Findings
Contextual embeddings outperform traditional methods in clinical concept extraction.
Pre-training on clinical corpora enhances embedding effectiveness.
Achieved new state-of-the-art F1 scores on multiple datasets.
Abstract
Neural network-based representations ("embeddings") have dramatically advanced natural language processing (NLP) tasks, including clinical NLP tasks such as concept extraction. Recently, however, more advanced embedding methods and representations (e.g., ELMo, BERT) have further pushed the state-of-the-art in NLP, yet there are no common best practices for how to integrate these representations into clinical tasks. The purpose of this study, then, is to explore the space of possible options in utilizing these new models for clinical concept extraction, including comparing these to traditional word embedding methods (word2vec, GloVe, fastText). Both off-the-shelf open-domain embeddings and pre-trained clinical embeddings from MIMIC-III are evaluated. We explore a battery of embedding methods consisting of traditional word embeddings and contextual embeddings, and compare these on four…
| \hlineB2.5 Dataset | Subset | #. Notes | #. Entities |
| i2b2 2010 | Train | 349 | 27,837 |
| Test | 477 | 45,009 | |
| i2b2 2012 | Train | 190 | 16,468 |
| Test | 120 | 13,594 | |
| SemEval 2014 Task 7 | Train | 199 | 5,816 |
| Test | 99 | 5,351 | |
| SemEval 2015 Task 14 | Train | 298 | 11,167 |
| Test | 133 | 7,998 | |
| \hlineB2.5 |
| \hlineB2.5 Method |
|
Size |
|
||||
| GloVe |
|
300 | NA | ||||
| fastText |
|
300 | NA | ||||
| ELMo |
|
512 |
|
||||
| BERTBASE |
|
768 |
|
||||
| BERTLARGE |
|
1024 |
|
||||
| \hlineB2.5 |
| \hlineB2.5 Method | i2b2 2010 | i2b2 2012 |
|
|
||||||||
| General | MIMIC | General | MIMIC | General | MIMIC | General | MIMIC | |||||
| word2vec | 80.38 | 84.32 | 71.07 | 75.09 | 72.2 | 77.48 | 73.09 | 76.42 | ||||
| GloVe | 84.08 | 85.07 | 74.95 | 75.27 | 70.22 | 77.73 | 72.13 | 76.68 | ||||
| fastText | 83.46 | 84.19 | 73.24 | 74.83 | 69.87 | 76.47 | 72.67 | 77.85 | ||||
| ELMo | 83.83 | 87.8 | 76.61 | 80.5 | 72.27 | 78.58 | 75.15 | 80.46 | ||||
| BERTbase | 84.33 | 89.55 | 76.62 | 80.34 | 76.76 | 80.07 | 77.57 | 80.67 | ||||
| BERTlarge | 85.48 | 90.25 | 78.14 | 80.91 | 78.75 | 80.74 | 77.97 | 81.65 | ||||
| BioBERT | 84.76 | - | 77.77 | - | 77.91 | - | 79.97 | - | ||||
| \hlineB1.8 Prior SOTA | 88.60 Zhu et al. (2018) | Liu et al. (2017) | 80.3 Tang et al. (2015) | 81.3 Zhang et al. (2014) | ||||||||
| \hlineB2.5 | ||||||||||||
| \hlineB2.5 | word2vec | GloVe | fastText | ELMo | BERTBASE | BERTLARGE |
|---|---|---|---|---|---|---|
| Problem | 84.16 | 85.08 | 84.32 | 88.76 | 89.61 | 89.26 |
| Test | 85.93 | 84.96 | 84.01 | 87.39 | 88.09 | 88.8 |
| Treatment | 83.14 | 84.73 | 83.89 | 86.98 | 88.3 | 89.14 |
| \hlineB2.5 |
| \hlineB2.5 | word2vec | GloVe | FastText | ELMo | BERTBASE | BERTLARGE |
|---|---|---|---|---|---|---|
| Problem | 76.49 | 77.83 | 75.35 | 84.1 | 85.91 | 86.1 |
| Test | 78.12 | 81.26 | 76.94 | 84.76 | 86.88 | 86.56 |
| Treatment | 76.22 | 78.52 | 76.88 | 83.9 | 84.27 | 85.09 |
| Clinical dept | 78.18 | 77.92 | 77.27 | 83.71 | 77.92 | 78.23 |
| Evidential | 73.14 | 74.26 | 72.94 | 72.95 | 74.21 | 74.96 |
| Occurrence | 64.77 | 64.19 | 61.02 | 66.27 | 62.36 | 65.65 |
| \hlineB2.5 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Sigmoid Activation · Tanh Activation · GloVe Embeddings · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections
Enhancing Clinical Concept Extraction with Contextual Embeddings
Yuqi Si Jingqi Wang Hua Xu Kirk Roberts
School of Biomedical Informatics
The University of Texas Health Science Center at Houston
{yuqi.si,kirk.roberts}@uth.tmc.edu
Abstract
Neural network-based representations (``embeddings") have dramatically advanced natural language processing (NLP) tasks, including clinical NLP tasks such as concept extraction. Recently, however, more advanced embedding methods and representations (e.g., ELMo, BERT) have further pushed the state-of-the-art in NLP, yet there are no common best practices for how to integrate these representations into clinical tasks. The purpose of this study, then, is to explore the space of possible options in utilizing these new models for clinical concept extraction, including comparing these to traditional word embedding methods (word2vec, GloVe, fastText). Both off-the-shelf, open-domain embeddings and pretrained clinical embeddings from MIMIC-III are evaluated. We explore a battery of embedding methods consisting of traditional word embeddings and contextual embeddings, and compare these on four concept extraction corpora: i2b2 2010, i2b2 2012, SemEval 2014, and SemEval 2015. We also analyze the impact of the pre-training time of a large language model like ELMo or BERT on the extraction performance. Last, we present an intuitive way to understand the semantic information encoded by contextual embeddings. Contextual embeddings pre-trained on a large clinical corpus achieves new state-of-the-art performances across all concept extraction tasks. The best-performing model outperforms all state-of-the-art methods with respective F1-measures of 90.25, 93.18 (partial), 80.74, and 81.65. We demonstrate the potential of contextual embeddings through the state-of-the-art performance these methods achieve on clinical concept extraction. Additionally, we demonstrate that contextual embeddings encode valuable semantic information not accounted for in traditional word representations.
1 Introduction
Concept extraction is the most common clinical natural language processing (NLP) task Tang et al. (2013); Kundeti et al. (2016); Unanue et al. (2017); Wang et al. (2018b), and a precursor to downstream tasks such as relations Rink et al. (2011), frame parsing Gupta et al. (2018); Si and Roberts (2018), co-reference Lee et al. (2011), and phenotyping Xu et al. (2011); Velupillai et al. (2018). Corpora such as those from i2b2 Uzuner et al. (2011); Sun et al. (2013); Stubbs et al. (2015), ShARe/CLEF Suominen et al. (2013); Kelly et al. (2014), and SemEval Pradhan et al. (2014); Elhadad et al. (2015); Bethard et al. (2016) act as evaluation benchmarks and datasets for training machine learning (ML) models.
Meanwhile, neural network-based representations continue to advance nearly all areas of NLP, from question answering Shen et al. (2017) to named entity recognition Chang et al. (2015); Wu et al. (2015); Habibi et al. (2017); Unanue et al. (2017); Florez et al. (2018) (a close analog of concept extraction). Recent advances in contextualized representations, including ELMo Peters et al. (2018) and BERT Devlin et al. (2018), have pushed performance even further. These have demonstrated that relatively simple downstream models using contextualized embeddings can outperform complex models Seo et al. (2016) using embeddings such as word2vec Mikolov et al. (2013) and GloVe Pennington et al. (2014).
In this paper, we aim to explore the potential impact these representations have on clinical concept extraction. Our contributions include the following:
An evaluation exploring numerous embedding methods: word2vec Mikolov et al. (2013), GloVe Pennington et al. (2014), fastText Bojanowski et al. (2016), ELMo Peters et al. (2018), and BERT Devlin et al. (2018). 2. 2.
An analysis covering four clinical concept corpora, demonstrating the generalizability of these methods. 3. 3.
A performance increase for clinical concept extraction that achieves state-of-the-art results on all four corpora. 4. 4.
A demonstration of the effect of pre-training on clinical corpora vs larger open domain corpora, an important trade-off in clinical NLP Roberts (2016). 5. 5.
A detailed analysis of the effect of pre-training time when starting from pre-built open domain models, which is important due to the long pre-training time of methods such as ELMo and BERT.
2 Background
This section introduces the theoretical knowledge that supports the shift from word-level embeddings to contextual embeddings.
2.1 Word Embedding Models
Word-level vector representation methods learn a real-valued vector to represent a single word. One of the most prominent methods for word-level representation is word2vec Mikolov et al. (2013). So far, word2vec has widely established its effectiveness for achieving state-of-the-art performances in a variety of clinical NLP tasks Wang et al. (2018a). GloVe Pennington et al. (2014) is another unsupervised learning approach to obtain a vector representation for a single word. Unlike word2vec, GloVe is a statistical model that aggregates both a global matrix factorization and a local context window. The learning relies on dimensionality reduction on the co-occurrence count matrix based on how frequently a word appears in a context. fastText Bojanowski et al. (2016) is also an established library for word representations. Unlike word2vec and GloVe, fastText considers individual words as character n-grams. For instance, cold is made of the n-grams c, co, col, cold, o, ol, old, l, ld, and d. This approach enables handling of infrequent words that are not present in the training vocabulary, alleviating some out-of-vocabulary issues.
However, the effectiveness of word-level representations is hindered by the limitation that they conflate all possible meanings of a word into a single representation and so the embedding is not adjusted to the surrounding context. In order to tackle these deficiencies, advanced approaches have attempted to directly model the word's context into the vector representation. Figure 1 illustrates this with the word cold, in which a traditional word embedding assigns all senses of the word cold with a single vector, whereas a contextual representation varies the vector based on its meaning in context (e.g., cold temperature, medical symptom/condition, an unfriendly disposition). Although a fictional figure is shown here, we later demonstrate this on real data.
The first contextual word representation that we consider to overcome this issue is ELMo Peters et al. (2018). Unlike the previously mentioned traditional word embeddings that constitute a single vector for each word and the vector remains stable in downstream tasks, this contextual word representation can capture the context information and dynamically alter a multilayer representation. At training time, a language model objective is used to learn the context-sensitive embeddings from a large text corpus. The training step of learning these context-sensitive embeddings is known as pre-training. After pre-training, the context-sensitive embedding of each word will be fed into the sentences for downstream tasks. The downstream task learns the shared weights of the inner state of pre-trained language model by optimizing the loss on the downstream task.
BERT Devlin et al. (2018) is also a contextual word representation model, and, similar to ELMo, pre-training on an unlabeled corpus with a language model objective. Compared to ELMo, BERT is deeper in how it handles contextual information due to a deep bidirectional transformer for encoding sentences. It is based on a transformer architecture employing self-attention Vaswani et al. (2017). The deep bidirectional transformer is equipped with multi-headed self-attention to prevent locality bias and to achieve long-distance context comprehension. Additionally, in terms of the strategy for how to incorporate these models into the downstream task, ELMo is a feature-based language representation while BERT is a fine-tuning approach. The feature-based strategy is similar to traditional word embedding methods that considers the embedding as input features for the downstream task. The fine-tuning approach, on the other hand, adjusts the entire language model on the downstream task to achieve a task-specific architecture. So while the ELMo embeddings may be used as the input of a downstream model, with the BERT fine-tuning method, the entire BERT model is integrated into the downstream task. This fine-tuning strategy is more likely to make use of the encoded information in the pre-trained language models.
2.2 Clinical Concept Extraction
Clinical concept extraction is the task of identifying medical concepts (e.g., problem, test, treatment) from clinical notes. This is typically considered as a sequence tagging problem to be solved with machine learning-based models (e.g., Conditional Random Field) using hand-engineered clinical domain knowledge as features Wang et al. (2018b); De Bruijn et al. (2011). Recent advances have demonstrated the effectiveness of deep learning-based models with word embeddings as input. Up to now, the most prominent model for clinical concept extraction is a bidirectional Long Short-Term Memory with Conditional Random Field (Bi-LSTM CRF) architecture Habibi et al. (2017); Florez et al. (2018); Chalapathy et al. (2016). The bidirectional LSTM-based recurrent neural network captures both forward and backward information in the sentence and the CRF layer considers sequential output correlations in the decoding layer using the Viterbi algorithm.
Most similar to this paper, several recent works have applied contextual embedding methods to concept extraction, both for clinical text and biomedical literature. For instance, ELMo has shown excellent performance on clinical concept extraction Zhu et al. (2018). BioBERT Lee et al. (2019) applied BERT primarily to literature concept extraction, pre-training on MEDLINE abstracts and PubMed Central articles, but also applied this model to the i2b2 2010 corpus without clinical pre-training (we include BioBERT in our experiments below). A recent preprint by Alsentzer et al. (2019) pre-trains on MIMIC-III, similar to our work, but achieves lower performance on the two tasks in common, i2b2 2010 and 2012. Their work does suggest potential value in only pre-training on MIMIC-III discharge summaries, as opposed to all notes, as well as combining clinical pre-training with literature pre-training. Finally, another recent preprint proposes the use of BERT not for concept extraction, but for clinical prediction tasks such as 30-day readmission prediction Huang et al. (2019).
3 Methods
In this paper, we consider both off-the-shelf embeddings from the open domain as well as pretraining clinical domain embeddings on clinical notes from MIMIC-III Johnson et al. (2016), which is a public database of Intensive Care Unit (ICU) patients.
For the traditional word-embedding experiments, the static embeddings are fed into a Bi-LSTM CRF architecture. All words that occur at least five times in the corpus are included and infrequent words are denoted as UNK. To compensate for the loss due of those unknown words, character embeddings for each word are included.
For ELMo, the context-independent embeddings with trainable weights are used to form context-dependent embeddings, which are then fed into the downstream task. Specifically, the context-dependent embedding is obtained through a low-dimensional projection and a highway connection after a stacked layer of a character-based Convolutional Neural Network (char-CNN) and a two-layer Bi-LSTM language model (bi-LM). Thus, the contextual word embedding is formed with a trainable aggregation of highly-connected bi-LM. Because the context-independent embeddings already consider representation of characters, it is not necessary to learn a character embedding input for the Bi-LSTM in concept extraction. Finally, the contextual word embedding for each word is fed into the prior state-of-the-art sequence labeling architecture, Bi-LSTM CRF, to predict the label for each token.
For BERT, both the BERTBASE and BERTLARGE off-the-shelf models are used with additional Bi-LSTM layers at the top of the BERT architecture, which we refer to as
BERTBASE(General) and BERTLARGE(General), respectively. For background, the BERT authors released two off-the-shelf cased models: BERTBASE and BERTLARGE, with 110 million and 340 million total parameters, respectively. BERTBASE has 12 layers of transformer blocks, 768 hidden units, and 12 self-attention heads, while BERTLARGE has 24 layers of transformer blocks, 1024 hidden units, and 16 self-attention heads. So BERTLARGE is both wider" and deeper" in model structure, but is otherwise essentially the same architecture. The models initiated from BERTBASE(General) and BERTLARGE(General) are fine-tuned on the downstream task (e.g., clinical concept recognition in our case). Because BERT integrates sufficient label-correlation information, the CRF layer is abandoned and only a Bi-LSTM architecture is used for sequence labeling. Additionally, two clinical domain embedding models are pre-trained on MIMIC-III, initiated from the BERTBASE and BERTLARGE checkpoints, which we refer to as BERTBASE(MIMIC) and BERTLARGE(MIMIC), respectively.
4 Datasets and Experiments
4.1 Datasets
Our experiments are performed on four widely-studied shared tasks, the 2010 i2b2/VA challenge Uzuner et al. (2011), the 2012 i2b2 challenge Sun et al. (2013), the SemEval 2014 Task 7 Pradhan et al. (2014) and the SemEval 2015 Task 14 Elhadad et al. (2015). The descriptive statistics for the datasets are shown in Table 1.The 2010 i2b2/VA challenge data contains a total of 349 training and 477 testing reports with clinical concept types: Problem, Test and Treatment. The 2012 i2b2 challenge data contains 190 training and 120 testing discharge summaries, with 6 clinical concept types: Problem, Test, Treatment, Clinical department, Evidential, and Occurrence. The SemEval 2014 Task 7 data contains 199 training and 99 testing reports with the concept type: Disease disorder. The SemEval 2015 Task 14 data consists of 298 training and 133 testing reports with the concept type: Disease disorder. For the two SemEval tasks, the disjoint concepts are handled with ``BIOHD" tagging schema Tang et al. (2015).
The clinical embeddings are trained on MIMIC III Johnson et al. (2016), which consists of almost 2 million clinical notes. Notes that have an ERROR tag are first removed, ending up with 1,908,359 notes and 786,414,528 tokens and a vocabulary of size 712,286. For pre-training traditional word embeddings, words are lowercased, as is standard practice. For pre-training ELMo and BERT, casing is preserved.
4.2 Experimental Setting
4.2.1 Concept Extraction
Concept extraction is based on the model proposed in Lample et al., Lample et al. (2016), a Bi-LSTM CRF architecture. For traditional embedding methods and ELMo embeddings, we use the same hyperparameters setting: hidden unit dimension at 512, dropout probability at 0.5, learning rate at 0.001, learning rate decay at 0.9, and Adam as the optimization algorithm. Early stopping of training is set to 5 epochs without improvement to prevent overfitting.
4.2.2 Pre-training of Clinical Embeddings
Across embedding methods, two different scenarios of pre-training are investigated and compared:
Off-the-shelf embeddings from the official release, referred to as the General model. 2. 2.
Pre-trained embeddings on MIMIC-III, referred to as the MIMIC model..
In the first scenario, more details related to the embedding models are shown in in Table 2. We also apply BioBERT Lee et al. (2019), which is the most recent pre-trained model on biomedical literature initiated from BERTBASE.
In the second scenario, for all the traditional embedding methods, we pre-train 300 dimension embeddings from MIMIC-III clinical notes. We apply the following hyperparameter settings for all three traditional embedding methods including word2vec, GloVe, and fastText: window size of 15, minimum word count of 5, 15 iterations, and embedding size of 300 to match the off-the-shelf embeddings.
For ELMo, the hyperparameter setting for pre-training follows the default in Peters et al., Peters et al. (2018). Specifically, a char-CNN embedding layer is applied with 16-dimension character embeddings, filter widths of [1, 2, 3, 4, 5, 6, 7] with respective [32, 32, 64, 128, 256, 512, 1024] number of filters. After that, a two-layer Bi-LSTM with 4,096 hidden units in each layer is added. The output of the final bi-LM language model is projected to 512 dimensions with a highway connection. MIMIC-III was split into a training corpus (80%) for pre-training and a held-out testing corpus (20%) for evaluating perplexity. The pre-training step is performed on the training corpus for 15 epochs. The average perplexity on the testing corpus is 9.929.
For BERT, two clinical-domain models initialized from BERTBASE and BERTLARGE are pre-trained. Unless specified, we follow the authors’ detailed instructions to set up the pre-training parameters, as other options were tested and it has been concluded that this is a useful recipe when pre-training from their released model (e.g., poor model convergence). The vocabulary list consisting of 28,996 word-pieced tokens applied in BERTBASE and BERTLARGE is adopted. According to their paper, the performance on the downstream tasks decrease as the training steps increase, thus we decide to save the intermediate checkpoint (every 20,000 steps) and report the performance of intermediate models on the downstream task.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: a system for large-scale machine learning. In OSDI , volume 16, pages 265–283.
- 2Alsentzer et al. (2019) Emily Alsentzer, John R Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew Mc Dermott. 2019. Publicly available clinical bert embeddings. ar Xiv preprint ar Xiv:1904.03323 .
- 3Bethard et al. (2016) Steven Bethard, Guergana Savova, Wei-Te Chen, Leon Derczynski, James Pustejovsky, and Marc Verhagen. 2016. Semeval-2016 task 12: Clinical tempeval. In Proceedings of the 10th International Workshop on Semantic Evaluation (Sem Eval-2016) , pages 1052–1062.
- 4Bojanowski et al. (2016) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with subword information. ar Xiv preprint ar Xiv:1607.04606 .
- 5Browne et al. (2000) Allen C Browne, Alexa T Mc Cray, and Suresh Srinivasan. 2000. The specialist lexicon. National Library of Medicine Technical Reports , pages 18–21.
- 6Chalapathy et al. (2016) Raghavendra Chalapathy, Ehsan Zare Borzeshi, and Massimo Piccardi. 2016. Bidirectional lstm-crf for clinical concept extraction. ar Xiv preprint ar Xiv:1611.08373 .
- 7Chang et al. (2015) FX Chang, J Guo, WR Xu, and S Relly Chung. 2015. Application of word embeddings in biomedical named entity recognition tasks. Journal of Digital Information Management , 13(5).
- 8De Bruijn et al. (2011) Berry De Bruijn, Colin Cherry, Svetlana Kiritchenko, Joel Martin, and Xiaodan Zhu. 2011. Machine-learned solutions for three stages of clinical information extraction: the state of the art at i 2b 2 2010. Journal of the American Medical Informatics Association , 18(5):557–562.
