A Mixture Model for Learning Multi-Sense Word Embeddings
Dai Quoc Nguyen, Dat Quoc Nguyen, Ashutosh Modi, Stefan Thater and, Manfred Pinkal

TL;DR
This paper introduces a mixture model for learning multi-sense word embeddings that accounts for different senses and their varying importance, leading to improved performance on standard evaluation tasks.
Contribution
It presents a generalized mixture model that induces multiple senses of words with different weights, advancing previous multi-sense embedding methods.
Findings
Our model outperforms previous models on standard evaluation tasks.
It effectively captures multiple senses with varying importance.
The approach improves the quality of word embeddings in semantic tasks.
Abstract
Word embeddings are now a standard technique for inducing meaning representations for words. For getting good representations, it is important to take into account different senses of a word. In this paper, we propose a mixture model for learning multi-sense word embeddings. Our model generalizes the previous works in that it allows to induce different weights of different senses of a word. The experimental results show that our model outperforms previous models on standard evaluation tasks.
| Model | rw | SimLex | scws | ws353 | men |
|---|---|---|---|---|---|
| Huang et al. (2012) | – | – | 58.6 | 71.3 | – |
| Luong et al. (2013) | 34.36 | – | 48.48 | 64.58 | – |
| Qiu et al. (2014) | 32.13 | – | 53.40 | 65.19 | – |
| Neelakantan et al. (2014) | – | – | 65.5 | 69.2 | – |
| Chen et al. (2014) | – | – | 64.2 | – | – |
| Hill et al. (2015) | – | 41.4 | – | 65.5 | 69.9 |
| Vilnis and McCallum (2015) | – | 32.23 | – | 65.49 | 71.31 |
| Schnabel et al. (2015) | – | – | – | 64.0 | 70.7 |
| Rastogi et al. (2015) | 32.9 | 36.7 | 65.6 | 70.8 | 73.9 |
| Flekova and Gurevych (2016) | – | – | – | – | 74.26 |
| Word2Vec Skip-gram | 32.64 | 38.20 | 66.37 | 71.61 | 75.49 |
| 34.85 | 38.77 | 66.83 | 72.40 | 76.23 | |
| 35.27 | 38.70 | 66.80 | 72.05 | 76.05 | |
| 34.98 | 38.79 | 66.61 | 71.71 | 75.90 | |
| 35.56⋆ | 39.19⋆ | 66.65 | 72.29 | 76.37⋆ |
| Model | AvgSim | AvgSimC |
|---|---|---|
| Huang et al. (2012) | 62.8 | 65.7 |
| Neelakantan et al. (2014) | 67.3 | 69.3 |
| Chen et al. (2014) | 66.2 | 68.9 |
| Chen et al. (2015) | 65.7 | 66.4 |
| Wu and Giles (2015) | – | 66.4 |
| Jauhar et al. (2015) | – | 65.7 |
| Cheng and Kartsaklis (2015) | 62.5 | – |
| Iacobacci et al. (2015) | 62.4 | – |
| Cheng et al. (2015) | – | 65.9 |
| 66.6 | 66.7 | |
| 66.7 | 66.6 | |
| 66.4 | 66.6 | |
| 66.6 | 66.6 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A Mixture Model for Learning Multi-Sense Word Embeddings
Dai Quoc Nguyen1, Dat Quoc Nguyen2, Ashutosh Modi1, Stefan Thater1, Manfred Pinkal1
1Department of Computational Linguistics, Saarland University, Germany
{daiquocn, ashutosh, stth, pinkal}@coli.uni-saarland.de
2Department of Computing, Macquarie University, Australia
Abstract
Word embeddings are now a standard technique for inducing meaning representations for words. For getting good representations, it is important to take into account different senses of a word. In this paper, we propose a mixture model for learning multi-sense word embeddings. Our model generalizes the previous works in that it allows to induce different weights of different senses of a word. The experimental results show that our model outperforms previous models on standard evaluation tasks.
1 Introduction
Word embeddings have shown to be useful in various NLP tasks such as sentiment analysis, topic models, script learning, machine translation, sequence labeling and parsing Socher et al. (2013); Sutskever et al. (2014); Modi and Titov (2014); Nguyen et al. (2015a, b); Modi (2016); Ma and Hovy (2016); Nguyen et al. (2017); Modi et al. (2017). A word embedding captures the syntactic and semantic properties of a word by representing the word in a form of a real-valued vector Mikolov et al. (2013a, b); Pennington et al. (2014); Levy and Goldberg (2014).
However, usually word embedding models do not take into account lexical ambiguity. For example, the word is usually represented by a single vector representation for all senses including sloping land and financial institution. Recently, approaches have been proposed to learn multi-sense word embeddings, where each sense of a word corresponds to a sense-specific embedding. Reisinger and Mooney (2010), Huang et al. (2012) and Wu and Giles (2015) proposed methods to cluster the contexts of each word and then using cluster centroids as vector representations for word senses. Neelakantan et al. (2014), Tian et al. (2014), Li and Jurafsky (2015) and Chen et al. (2015) extended Word2Vec models Mikolov et al. (2013a, b) to learn a vector representation for each sense of a word. Chen et al. (2014), Iacobacci et al. (2015) and Flekova and Gurevych (2016) performed word sense induction using external resources (e.g., WordNet, BabelNet) and then learned sense embeddings using the Word2Vec models. Rothe and Schütze (2015) and Pilehvar and Collier (2016) presented methods using pre-trained word embeddings to learn embeddings from WordNet synsets. Cheng et al. (2015), Liu et al. (2015b), Liu et al. (2015a) and Zhang and Zhong (2016) directly opt the Word2Vec Skip-gram model Mikolov et al. (2013b) for learning the embeddings of words and topics on a topic-assigned corpus.
One issue in these previous works is that they assign the same weight to every sense of a word. The central assumption of our work is that each sense of a word given a context, should correspond to a mixture of weights reflecting different association degrees of the word with multiple senses in the context. The mixture weights will help to model word meaning better.
In this paper, we propose a new model for learning Multi-Sense Word Embeddings (mswe). Our mswe model learns vector representations of a word based on a mixture of its sense representations. The key difference between mswe and other models is that we induce the weights of senses while jointly learning the word and sense embeddings. Specifically, we train a topic model (Blei et al., 2003) to obtain the topic-to-word and document-to-topic probability distributions which are then used to infer the weights of topics. We use these weights to define a compositional vector representation for each target word to predict its context words. mswe thus is different from the topic-based models (Cheng et al., 2015; Liu et al., 2015b, a; Zhang and Zhong, 2016), in which we do not use the topic assignments when jointly learning vector representations of words and topics. Here we not only learn vectors based on the most suitable topic of a word given its context, but we also take into consideration all possible meanings of the word.
The main contributions of our study are: (i) We introduce a mixture model for learning word and sense embeddings (mswe) by inducing mixture weights of word senses. (ii) We show that mswe performs better than the baseline Word2Vec Skip-gram and other embedding models on the word analogy task Mikolov et al. (2013a) and the word similarity task Reisinger and Mooney (2010).
2 The mixture model
In this section, we present the mixture model for learning multi-sense word embeddings. Here we treat topics as senses. The model learns a representation for each word using a mixture of its topical representations.
Given a number of topics and a corpus of documents , we apply a topic model Blei et al. (2003) to obtain the topic-to-word and document-to-topic probability distributions. We then infer a weight for the word with topic in document :
[TABLE]
We define two mswe variants: mswe-1 learns vectors for words based on the most suitable topic given document while mswe-2 marginalizes over all senses of a word to take into account all possible senses of the word:
[TABLE]
where is the compositional vector representation of the word and the topics in document ; is the target vector representation of a word type in vocabulary ; is the vector representation of topic ; is the number of topics; is defined as in Equation 1, and in mswe-1 we define .
We learn representations by minimizing the following negative log-likelihood function:
[TABLE]
where the word in document is a target word while the word in document is a context word of and is the context size. In addition, is the context vector representation of the word type . The probability is defined using the softmax function as follows:
[TABLE]
Since computing is expensive for each training instance, we approximate in Equation 2 with the following negative-sampling objective Mikolov et al. (2013b):
[TABLE]
where each word is sampled from a noise distribution.111We use an unigram distribution raised to the 3/4 power Mikolov et al. (2013b) as the noise distribution. In fact, mswe can be viewed as a generalization of the well-known Word2Vec Skip-gram model with negative sampling Mikolov et al. (2013b) where all the mixture weights are set to zero. The models are trained using Stochastic Gradient Descent (SGD).
3 Experiments
We evaluate mswe on two different tasks: word similarity and word analogy. We also provide experimental results obtained by the baseline Word2Vec Skip-gram model and other previous works.
Note that not all previous results are mentioned in this paper for comparison because the training corpora used in most previous research work are much larger than ours (Baroni et al., 2014; Li and Jurafsky, 2015; Schwartz et al., 2015; Levy et al., 2015). Also there are differences in the pre-processing steps that could affect the results. We could also improve obtained results by using a larger training corpus, but this is not central point of our paper. The objective of our paper is that the embeddings of topic and word can be combined into a single mixture model, leading to good improvements as established empirically.
3.1 Experimental Setup
Following Huang et al. (2012) and Neelakantan et al. (2014), we use the Wesbury Lab Wikipedia corpus (Shaoul and Westbury, 2010) containing over 2M articles with about 990M words for training. In the preprocessing step, texts are lowercased and tokenized, numbers are mapped to 0, and punctuation marks are removed. We extract a vocabulary of 200,000 most frequent word tokens from the pre-processed corpus. Words not occurring in the vocabulary are mapped to a special token unk, in which we use the embedding of unk for unknown words in the benchmark datasets.
We firstly use a small subset extracted from the ws353 dataset (Finkelstein et al., 2002) to tune the hyper-parameters of the baseline Word2Vec Skip-gram model for the word similarity task (see Section 3.2 for the task definition). We then directly use the tuned hyper-parameters for our mswe variants. Vector size is also a hyper-parameter. While some approaches use a higher number of dimensions to obtain better results, we fix the vector size to be 300 as used by the baseline for a fair comparison. The vanilla Latent Dirichlet Allocation (LDA) topic model (Blei et al., 2003) is not scalable to a very large corpus, so we explore faster online topic models developed for large corpora. We train the online LDA topic model (Hoffman et al., 2010) on the training corpus, and use the output of this topic model to compute the mixture weights as in Equation 1.222We use default parameters in gensim (Řehůřek and Sojka, 2010) for the online LDA model. We also use the same ws353 subset to tune the numbers of topics . We find that the most suitable numbers are and then used for all our experiments. Here we learn 300-dimensional embeddings with the fixed context size (in Equation 2) and (in Equation 3) as used by the baseline. During training, we randomly initialize model parameters (i.e. word and topic embeddings) and then learn them by using SGD with the initial learning rate of 0.01.
3.2 Word Similarity
The word similarity task evaluates the quality of word embedding models Reisinger and Mooney (2010). For a given dataset of word pairs, the evaluation is done by calculating correlation between the similarity scores of corresponding word embedding pairs with the human judgment scores. Higher Spearman’s rank correlation () reflects better word embedding model. We evaluate mswe on standard datasets (as given in Table 1) for the word similarity evaluation task.
Following Reisinger and Mooney (2010), Huang et al. (2012), Neelakantan et al. (2014), we compute the similarity scores for a pair of words with or without their respective contexts as:
[TABLE]
where is the vector representation of the word , is the multiple representation of the word and the topic , is the vector representation of the context of the word . And is the cosine similarity between two vectors and . For our experiments, we set and , in which is the concatenation operation and is inferred from the topic models by considering context as a document. only regards word embeddings, while considers multiple representations to capture different meanings (i.e. topics) and usages of a word. generalizes by taking into account the likelihood that word takes topic given context . is the inverse of the cosine distance from to Huang et al. (2012); Neelakantan et al. (2014).
3.2.1 Results for word similarity
Table 2 compares the evaluation results of mswe with results reported in prior work on the standard word similarity task when using . We use subscripts 50 and 200 to denote the topic model trained with and topics, respectively. Table 2 shows that our model outperforms the baseline Word2Vec Skip-gram model (in fifth row from bottom). Specifically, on the rw dataset, mswe obtains a significant improvement of in the Spearman’s rank correlation (which is about 8.5% relative improvement).
Compared to the published results, mswe obtains the highest accuracy on the rw, scws, ws353 and men datasets, and achieves the second highest result on the SimLex dataset. These indicate that mswe learns better representations for words taking into account different meanings.
3.2.2 Results for contextual word similarity
We evaluate our model mswe by using and on the benchmark scws dataset which considers effects of the contextual information on the word similarity task. As shown in Table 3, mswe scores better than the closely related model proposed by Cheng et al. (2015) and generally obtains good results for this context sensitive dataset. Although we produce better scores than Neelakantan et al. (2014) and Chen et al. (2014) when using , we are outperformed by them when using and . Neelakantan et al. (2014) clustered the embeddings of the context words around each target word to predict its sense and Chen et al. (2014) used pre-trained word embeddings to initialize vector representations of senses taken from WordNet, while we use a fixed number of topics as senses for words in mswe.
3.3 Word Analogy
We evaluate the embedding models on the word analogy task introduced by Mikolov et al. (2013a). The task aims to answer questions in the form of “ is to as is to _ ”, denoted as “a : b c : ?” (e.g., “Hanoi : Vietnam Bern : ?”). There are 8,869 semantic and 10,675 syntactic questions grouped into 14 categories. Each question is answered by finding the most suitable word closest to “” measured by the cosine similarity. The answer is correct only if the found closest word is exactly the same as the gold-standard (correct) one for the question.
We report accuracies in Table 4 and show that mswe achieves better results in comparison with the baseline Word2Vec Skip-gram. In particular, mswe reaches the accuracies of around 69.7 which is higher than the accuracy of 68.6 obtained by Word2Vec Skip-gram.
4 Conclusions
In this paper, we described a mixture model for learning multi-sense embeddings. Our model induces mixture weights to represent a word given context based on a mixture of its sense representations. The results show that our model scores better than Word2Vec, and produces highly competitive results on the standard evaluation tasks. In future work, we will explore better methods for taking into account the contextual information. We also plan to explore different approaches to compute the mixture weights in our model. For example, if there is a large sense-annotated corpus available for training, the mixture weights could be defined based on the frequency (sense-count) distributions, instead of using the probability distributions produced by a topic model. Furthermore, it is possible to consider the weights of senses as additional model parameters to be then learned during training.
Acknowledgments
This research was funded by the German Research Foundation (DFG) as part of SFB 1102 “Information Density and Linguistic Encoding”. We would like to thank anonymous reviewers for their helpful comments.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Baroni et al. (2014) Marco Baroni, Georgiana Dinu, and Germán Kruszewski. 2014. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . pages 238–247.
- 2Blei et al. (2003) David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3:993–1022.
- 3Bruni et al. (2014) Elia Bruni, Nam Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. Journal of Artificial Intelligence Research 49:1–47.
- 4Chen et al. (2015) Tao Chen, Ruifeng Xu, Yulan He, and Xuan Wang. 2015. Improving distributed representation of word sense via wordnet gloss composition and context clustering. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) . pages 15–20.
- 5Chen et al. (2014) Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. 2014. A unified model for word sense representation and disambiguation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . pages 1025–1035.
- 6Cheng and Kartsaklis (2015) Jianpeng Cheng and Dimitri Kartsaklis. 2015. Syntax-aware multi-sense word embeddings for deep compositional models of meaning. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing . pages 1531–1542.
- 7Cheng et al. (2015) Jianpeng Cheng, Zhongyuan Wang, Ji-Rong Wen, Jun Yan, and Zheng Chen. 2015. Contextual text understanding in distributional semantic space. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management . pages 133–142.
- 8Finkelstein et al. (2002) Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2002. Placing search in context: The concept revisited. ACM Transactions on Information Systems 20:116–131.
