Context encoders as a simple but powerful extension of word2vec

Franziska Horn

arXiv:1706.02496·stat.ML·June 9, 2017

Context encoders as a simple but powerful extension of word2vec

Franziska Horn

PDF

1 Repo

TL;DR

This paper introduces context encoders (ConEc), an extension of word2vec that enhances word representations by incorporating local context, enabling better handling of polysemy and out-of-vocabulary words, demonstrated through improved NER performance.

Contribution

The paper proposes context encoders (ConEc), a simple yet effective extension of word2vec that generates context-dependent embeddings for out-of-vocabulary words and words with multiple meanings.

Findings

01

ConEc improves embeddings for OOV words.

02

Enhanced embeddings lead to better NER performance.

03

The approach is computationally efficient.

Abstract

With a simple architecture and the ability to learn meaningful word embeddings efficiently from texts containing billions of words, word2vec remains one of the most popular neural language models used today. However, as only a single embedding is learned for every word in the vocabulary, the model fails to optimally represent words with multiple meanings. Additionally, it is not possible to create embeddings for new (out-of-vocabulary) words on the spot. Based on an intuitive interpretation of the continuous bag-of-words (CBOW) word2vec model's negative sampling training objective in terms of predicting context based similarities, we motivate an extension of the model we call context encoders (ConEc). By multiplying the matrix of trained word2vec embeddings with a word's average context vector, out-of-vocabulary (OOV) embeddings and representations for a word with multiple meanings can…

Tables1

Table 1. Table 1: Accuracy on the analogy task with mean and standard deviation computed using three random seeds when initializing the word2vec model. The best results for each category and corpus are in bold.

	text8 (10 iter)				1-billion
	word2vec		Context Encoder		word2vec		Context Encoder
capital-common-countries	63.8 $\pm$	4.7	78.7 $\pm$	0.2	79.3 $\pm$	2.2	83.1 $\pm$	1.2
capital-world	34.0 $\pm$	2.1	54.7 $\pm$	1.3	63.8 $\pm$	1.4	75.9 $\pm$	0.4
currency	15.4 $\pm$	0.9	19.3 $\pm$	0.6	13.3 $\pm$	3.6	14.8 $\pm$	0.8
city-in-state	28.6 $\pm$	1.0	43.6 $\pm$	0.9	19.6 $\pm$	1.7	29.6 $\pm$	1.0
family	79.6 $\pm$	1.5	77.2 $\pm$	0.4	78.7 $\pm$	2.2	79.0 $\pm$	1.4
gram1-adjective-to-adverb	11.0 $\pm$	0.9	16.6 $\pm$	0.7	12.3 $\pm$	0.5	13.3 $\pm$	1.1
gram2-opposite	24.3 $\pm$	3.0	24.3 $\pm$	2.0	27.6 $\pm$	0.1	21.3 $\pm$	1.1
gram3-comparative	64.3 $\pm$	0.5	63.0 $\pm$	1.1	83.7 $\pm$	0.9	76.2 $\pm$	1.1
gram4-superlative	40.3 $\pm$	2.1	37.6 $\pm$	1.5	69.4 $\pm$	0.5	56.2 $\pm$	1.2
gram5-present-participle	30.5 $\pm$	1.0	31.7 $\pm$	0.4	78.4 $\pm$	1.0	68.0 $\pm$	0.7
gram6-nationality-adjective	70.6 $\pm$	1.5	67.2 $\pm$	1.4	83.8 $\pm$	0.6	83.8 $\pm$	0.5
gram7-past-tense	30.5 $\pm$	1.8	33.0 $\pm$	0.6	53.9 $\pm$	0.9	49.2 $\pm$	0.7
gram8-plural	49.8 $\pm$	0.3	49.2 $\pm$	1.2	62.7 $\pm$	1.9	56.7 $\pm$	1.0
gram9-plural-verbs	41.0 $\pm$	2.5	30.1 $\pm$	1.9	68.7 $\pm$	0.2	45.0 $\pm$	0.4
total	42.1 $\pm$	0.6	46.5 $\pm$	0.1	57.2 $\pm$	0.3	55.8 $\pm$	0.3

Equations4

x_{w_{global}} = \frac{1}{M _{w}} i = 1 \sum M_{w} x_{w_{i}},

x_{w_{global}} = \frac{1}{M _{w}} i = 1 \sum M_{w} x_{w_{i}},

y_{w} = (a \cdot x_{w_{global}} + (1 - a) x_{w_{local}})^{⊤} W_{0}

y_{w} = (a \cdot x_{w_{global}} + (1 - a) x_{w_{local}})^{⊤} W_{0}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cod3licious/conec
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Context encoders as a simple but powerful extension of word2vec

Franziska Horn

Machine Learning Group

Technische Universität Berlin, Germany

[email protected]

Abstract

With a simple architecture and the ability to learn meaningful word embeddings efficiently from texts containing billions of words, word2vec remains one of the most popular neural language models used today. However, as only a single embedding is learned for every word in the vocabulary, the model fails to optimally represent words with multiple meanings. Additionally, it is not possible to create embeddings for new (out-of-vocabulary) words on the spot. Based on an intuitive interpretation of the continuous bag-of-words (CBOW) word2vec model’s negative sampling training objective in terms of predicting context based similarities, we motivate an extension of the model we call context encoders (ConEc). By multiplying the matrix of trained word2vec embeddings with a word’s average context vector, out-of-vocabulary (OOV) embeddings and representations for a word with multiple meanings can be created based on the word’s local contexts. The benefits of this approach are illustrated by using these word embeddings as features in the CoNLL 2003 named entity recognition (NER) task.

1 Introduction

Representation learning is very prominent in the field of natural language processing (NLP). For example, word embeddings learned by neural language models (NLM) were shown to improve the performance when used as features for supervised learning tasks such as named entity recognition (NER) (Collobert et al., 2011; Turian et al., 2010). The popular word2vec model (Mikolov et al., 2013a, b) learns meaningful word embeddings by considering only the words’ local contexts. Thanks to its shallow architecture it can be trained very efficiently on large corpora. The model, however, only learns a single representation for words from a fixed vocabulary. Consequently, if in a task we encounter a new word that was not present in the texts used for training, we cannot create an embedding for this word without repeating the time consuming training procedure of the model.111In practice the model is trained on such a large vocabulary that it is rare to encounter a word that does not have an embedding. Yet there are still scenarios where this is the case, for example, it is unlikely that the term “W10281545” is encountered in a regular training corpus, but we might still want its embedding to represent a search query like “whirlpool W10281545 ice maker part”. Furthermore, a single embedding does not optimally represent a word with multiple meanings. For example, “Washington” is both the name of a US state as well as a former president and only by taking into account the word’s local context can one identify the proper sense.

Based on an intuitive interpretation of the continuous bag-of-words (CBOW) word2vec model’s negative sampling training objective, we propose an extension of the model we call context encoders (ConEc). This allows for an easy creation of OOV embeddings as well as a better representation of words with multiple meanings by simply multiplying the trained word2vec embeddings with the words’ average context vectors. As demonstrated by the CoNLL 2003 NER challenge, the classification performance can be significantly improved when using as features the word embeddings created with ConEc instead of word2vec.

Related work

In the past, NLM have addressed the issue of polysemy in various ways. For example, sense2vec is an extension of word2vec, where in a preprocessing step all words in the training corpus are annotated with their part-of-speech (POS) tag and then the embeddings are learned for tokens consisting of the words themselves and their POS tags. This way, different representations are generated e.g. for words that are used both as a noun and verb (Trask et al., 2015). Other methods first cluster the contexts in which the words appear (Huang et al., 2012) or use additional resources such as wordnet to identify multiple meanings of words (Rothe and Schütze, 2015). One possibility to create OOV embeddings is to learn representations for all character n-grams in the texts and then compute the embedding of a word by combining the embeddings of the n-grams occurring in it (Bojanowski et al., 2016). However, none of these NLM are designed to solve both the OOV and polysemy problem at the same time. Furthermore, compared to word2vec they require more parameters, resources, or additional steps in the training procedure. ConEc on the other hand can generate OOV embeddings as well as improved representations for words with multiple meanings by simply multiplying the matrix of trained word2vec embeddings with the words’ average context vectors.

2 Background: CBOW word2vec trained with negative sampling

Word2vec (Fig. 3 in the Appendix) learns $d\,$ -dimensional vector representations, referred to as word embeddings, for all $N$ words in the vocabulary. It is a shallow NLM with parameter matrices $W_{0},W_{1}\in\mathbb{R}^{N\times d}$ , which are tuned iteratively by scanning huge amounts of text sentence by sentence. Based on some context words, the algorithm tries to predict the target word between them. Mathematically, this is realized by first computing the sum of the embeddings of the context words by selecting the appropriate rows from $W_{0}$ . This vector is then multiplied by several rows selected from $W_{1}$ : one of these rows corresponds to the target word, while the others correspond to $k$ ‘noise’ words selected at random (negative sampling). After applying a non-linear activation function, the backpropagation error is computed by comparing this output to a label vector $\mathbf{t}\in\mathbb{R}^{k+1}$ , which is 1 at the position of the target word and 0 for all $k$ noise words. After the training of the model is complete, the word embedding for a target word is the corresponding row of $W_{0}$ .

3 Context Encoders

Similar words appear in similar contexts (Harris, 1954). For example, two words synonymous with each other could be exchanged for one another in almost all contexts without a reader noticing. Based on the context word co-occurrences, pairwise similarities between all $N$ words of the vocabulary can be computed, resulting in a similarity matrix $S\in\mathbb{R}^{N\times N}$ (or for a single word $w$ the vector $\mathbf{s}_{w}\in\mathbb{R}^{N}$ ) with similarity scores between [math] and $1$ . These similarities should be preserved in the word embeddings, e.g. the cosine similarity between the embedding vectors of two words used in similar contexts should be close to $1$ , or, more generally, the scalar product of the matrix with word embeddings $Y\in\mathbb{R}^{N\times d}$ should approximate $S$ . Obviously, the most straightforward way of obtaining word embeddings satisfying $YY^{\top}\approx S$ would be to compute the singular value decomposition (SVD) of the similarity matrix $S$ and use the eigenvectors corresponding to the $d$ largest eigenvalues Levy et al. (2014, 2015). As our vocabulary typically comprises tens of thousands of words, performing an SVD of the corresponding similarity matrix is computationally far too expensive. Yet, while the similarity matrix would be huge, it would also be quite sparse, as many words are of course not synonymous with each other. If we picked a small number $k$ of random words, chances are their similarities to a target word would be close to [math]. Therefore, while the product of a single word’s embedding $\mathbf{y}_{w}\in\mathbb{R}^{d}$ and the matrix of all embeddings $Y$ should result in a vector $\mathbf{\hat{s}}_{w}\in\mathbb{R}^{N}$ close to the true similarities $\mathbf{s}_{w}$ of this word, if we only consider a small subset of $\mathbf{\hat{s}}_{w}$ corresponding to the word itself and $k$ random words, it is sufficient if this approximates the binary vector $\mathbf{t}_{w}\in\mathbb{R}^{k+1}$ , which is $1$ for the word itself and [math] elsewhere.

The CBOW word2vec model trained with negative sampling can therefore be interpreted as a neural network (NN) that predicts a word’s similarities to other words (Fig. 1). During training, for each occurrence $i$ of a word $w$ in the texts, a binary vector $\mathbf{x}_{w_{i}}\in\mathbb{R}^{N}$ , which is $1$ at the positions of the context words of $w$ and [math] elsewhere, is used as input to the network and multiplied by a set of weights $W_{0}$ to arrive at an embedding $\mathbf{y}_{w_{i}}\in\mathbb{R}^{d}$ (the summed rows of $W_{0}$ corresponding to the context words). This embedding is then multiplied by another set of weights $W_{1}$ , which corresponds to the full matrix of word embeddings $Y$ , to produce the output of the network, a vector $\mathbf{\hat{s}}_{w_{i}}\in\mathbb{R}^{N}$ containing the approximated similarities of the word $w$ to all other words. The training error is then computed by comparing a subset of the output to a binary target vector $\mathbf{t}_{w_{i}}\in\mathbb{R}^{k+1}$ , which serves as an approximation of the true similarities $\mathbf{s}_{w}$ when considering only a small number of random words. We refer to this interpretation of the model as context encoders (ConEc), as it is closely related to similarity encoders (SimEc), a dimensionality reduction method used for learning similarity preserving representations of data points (Horn and Müller, 2017).

While the training procedure of ConEc is identical to that of word2vec, there is a difference in the computation of a word’s embedding after the training is complete. In the case of word2vec, the word embedding is simply the row of the tuned $W_{0}$ matrix. When considering the idea behind the optimization procedure, we instead propose to create the representation of a target word $w$ by multiplying $W_{0}$ with the word’s average context vector $\mathbf{x}_{w}$ , as this better resembles how the word embeddings are computed during training.

We distinguish between a word’s ‘global’ and ‘local’ average context vector (CV): The global CV is computed as the average of all binary CVs $\mathbf{x}_{w_{i}}$ corresponding to the $M_{w}$ occurrences of $w$ in the whole training corpus:

[TABLE]

while the local CV $\mathbf{x}_{w_{\text{local}}}$ is computed likewise but considering only the $m_{w}$ occurrences of $w$ in a single document. We can now compute the embedding of a word $w$ by multiplying $W_{0}$ with the weighted average between both CVs:

[TABLE]

with $a\in[0,1]$ . The choice of $a$ determines how much emphasis is placed on the word’s local context, which helps to distinguish between multiple meanings of the word (Melamud et al., 2015).222This implicitly assumes a word is only used in a single sense in one document. As an out-of-vocabulary word does not have a global CV (as it never occurred in the training corpus), its embedding is computed solely based on the local context, i.e. setting $a=0$ .

With this new perspective on the model and optimization procedure, another advancement is feasible. Since the context words are merely a sparse feature vector used as input to a NN, there is no reason why this input vector should not contain other features about the target word as well. For example, the feature vector $\mathbf{x}_{w}$ could be extended to contain information about the word’s case, part-of-speech (POS) tag, or other relevant details. While this would increase the dimensionality of the first weight matrix $W_{0}$ to include the additional features when mapping the input to the word’s embedding, the training objective and therefore also $W_{1}$ would remain unchanged. These additional features could be especially helpful if details about the words would otherwise get lost in preprocessing (e.g. by lowercasing) or to retain information about a word’s position in the sentence, which is ignored in a BOW approach. These extended ConEcs are expected to create embeddings that even better distinguish between the words’ different senses by taking into account, for example, if the word is used as a noun or verb in the current context, similar to the sense2vec algorithm (Trask et al., 2015). But instead of explicitly learning multiple embeddings per term, like sense2vec, only the dimensionality of the input vector is increased to include the POS tag of the current word as a feature, which is expected to improve generalization if few training examples are available.

4 Experiments

The word embeddings learned by word2vec and context encoders are evaluated on the CoNLL 2003 NER benchmark task (Tjong et al., 2003). We use a CBOW word2vec model trained with negative sampling as described above where $k=13$ , the embedding dimensionality $d$ is $200$ and we use a context window of $5$ words. The word embeddings created by ConEc are built directly on top of the word2vec model by multiplying the original embeddings ( $W_{0}$ ) with the respective context vectors. Code to replicate the experiments is available online.333https://github.com/cod3licious/conec Additionally, the performance on a word analogy task (Mikolov et al., 2013a) is reported in the Appendix.

Named Entity Recognition

The main advantage of context encoders is their ability to use local context to create OOV embeddings and distinguish between the different senses of words. The effects of this are most prominent in a task such as NER, where the local context of a word can make all the difference, e.g. to distinguish between the “Chicago Bears” (an organization) and the city of Chicago (a location). We tested this on the CoNLL 2003 NER task by using the word embeddings as features together with a logistic regression classifier. The reported F1-scores were computed using the official evaluation script. The results achieved with various word embeddings in the training, development, and test part of the CoNLL task are reported in Fig. 2. It should be noted that we are using this task as an extrinsic evaluation to illustrate the advantages of ConEc embeddings over the regular word2vec embeddings. To isolate the effects on the performance, we are only using these word embeddings as features, while typically the performance on this NER challenge is much higher when other features such as a word’s case or POS tag are included as well.

The word2vec embeddings were trained on the documents used in the training part of the task. OOV words in the development and test parts are represented as zero vectors.444Since this is a very small corpus, we trained word2vec for 25 iterations on these documents. With three parameter settings, we illustrate the advantages of ConEc:

A) Multiplying the word2vec embeddings by the words’ average context vectors generally improves the embeddings. To show this, ConEc word embeddings were computed using only global CVs (Eq. 1 with $a=1$ ), which means OOV words again have a zero representation. With these embeddings (labeled ‘global’ in Fig. 2), the performance improves on the dev and test folds of the task.

B) Useful OOV embeddings can be created from the local context of a new word. To show this, the ConEc embeddings for words from the training vocabulary ( $w\in N$ ) were computed as in A), but now the embeddings for OOV words ( $w^{\prime}\notin N$ ) were computed using local CVs (Eq. 1 with $a=1\;\forall\,w\in N$ and $a=0\;\forall\,w^{\prime}\notin N$ ; referred to as ‘OOV’ in the figure). The training performance obviously stays the same, because here all words have an embedding based on their global contexts. However, there is a jump in the ConEc performance on the dev and test folds, where OOV words now have a representation based on their local contexts.

C) Better embeddings for a word with multiple meanings can be created by using a combination of the word’s average global and local CVs as input to the ConEc. To show this, the OOV embeddings were computed as in B), but now for the words occurring in the training vocabulary, the local context was taken into account as well by setting $a<1$ (Eq. 1 with $a\in[0,1)\;\forall\,w\in N$ and $a=0\;\forall\,w^{\prime}\notin N$ ). The best performances on all folds are achieved when averaging the global and local CVs with around $a=0.6$ before multiplying them with the word2vec embeddings. This clearly shows that ConEc embeddings created by incorporating local context can help distinguish between multiple meanings of words.

5 Conclusion

Context encoders are a simple but powerful extension of the CBOW word2vec model trained with negative sampling. By multiplying the matrix of trained word2vec embeddings with the words’ average context vectors, ConEcs are easily able to create OOV embeddings on the spot as well as distinguish between multiple meanings of words based on their local contexts. The benefits of this were demonstrated in the CoNLL NER challenge.

Acknowledgments

I would like to thank Antje Relitz, Ivana Balažević, Christoph Hartmann, Andreas Nowag, Klaus-Robert Müller, and other anonymous reviewers for their helpful comments on earlier versions of this manuscript.

Franziska Horn acknowledges funding from the Elsa-Neumann scholarship from the TU Berlin.

Appendix

Analogy task

To show that the word embeddings created with context encoders capture meaningful semantic and syntactic relationships between words, we evaluated them on the original analogy task published together with the word2vec model (Mikolov et al., 2013a).555https://code.google.com/archive/p/word2vec/ This task consists of many questions in the form of “man is to king as woman is to XXX” where the model is supposed to find the correct answer queen. This is accomplished by taking the word embedding for king, subtracting from it the embedding for man and then adding the embedding for woman. This new word vector should then be most similar (with respect to the cosine similarity) to the embedding for queen.666Readers familiar with Levy et al. (2015) will recognize this as the 3CosAdd method. We have tried 3CosMul as well, but found that the results did not improve significantly and therefore omitted them here. The word2vec model was trained for ten iterations on the text8 corpus,777http://mattmahoney.net/dc/text8.zip which contains around 17 million words and a vocabulary of about 70k unique words, as well as the training part of the 1-billion benchmark dataset,888http://code.google.com/p/1-billion-word-language-modeling-benchmark/ which contains over 768 million words with a vocabulary of 486k unique words.999In this experiment we ignore all words that occur less than 5 times in the training corpus. The ConEc embeddings were then constructed by multiplying the word2vec embeddings with the words’ average global context vectors obtained from the same corpus as the word2vec model was trained on. To achieve the best results, we also had to include the target word itself in these context vectors.

The results of the analogy task are shown in Table 1. To capture some of the semantic relations between words (e.g. the first four task categories) it can be advantageous to use context encoders instead of word2vec. One reason for the ConEcs’ superior performance on some of the task categories, but not others, might be that the city and country names compared in the first four task categories only have a single sense (referring to the respective location), while the words asked for in other task categories can have multiple meanings. For example, “run” can be used as both a noun or a verb, additionally, in some contexts it refers to the sport activity while other times it is used in a more abstract sense, e.g. in the context of someone running for president. Therefore, the results in the other task categories might improve if the words’ context vectors are first clustered and then the ConEc embedding is generated by multiplying the word2vec embeddings with the average of only those context vectors corresponding to the word’s sense most appropriate for the task category.

Bibliography15

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bojanowski et al. (2016) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with subword information. ar Xiv preprint ar Xiv:1607.04606 .
2Collobert et al. (2011) Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. The Journal of Machine Learning Research 12:2493–2537.
3Goldberg and Levy (2014) Yoav Goldberg and Omer Levy. 2014. word 2vec explained: Deriving Mikolov et al.’s negative-sampling word-embedding method. ar Xiv preprint ar Xiv:1402.3722 .
4Harris (1954) Zellig S Harris. 1954. Distributional structure. Word 10(2-3):146–162.
5Horn and Müller (2017) Franziska Horn and Klaus-Robert Müller. 2017. Learning similarity preserving representations with neural similarity encoders. ar Xiv preprint ar Xiv:1702.01824 .
6Huang et al. (2012) Eric H Huang, Richard Socher, Christopher D Manning, and Andrew Y Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1 . ACL, pages 873–882.
7Levy et al. (2015) Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3:211–225.
8Levy et al. (2014) Omer Levy, Yoav Goldberg, and Israel Ramat-Gan. 2014. Linguistic regularities in sparse and explicit word representations. In Co NLL . pages 171–180.