TL;DR
This paper demonstrates that using sense-level embeddings derived from contextual language models and WordNet can significantly improve Word Sense Disambiguation, surpassing complex neural models with a simple nearest neighbor approach.
Contribution
It introduces a method to create full-coverage sense embeddings from contextual models without task-specific tuning, enabling effective WSD with simple algorithms.
Findings
Sense embeddings outperform previous neural models in WSD tasks.
Robustness analysis reveals limitations when ignoring POS and lemma features.
Sense embeddings facilitate concept-level analysis of language models.
Abstract
Contextual embeddings represent a new generation of semantic representations learned from Neural Language Modelling (NLM) that addresses the issue of meaning conflation hampering traditional word embeddings. In this work, we show that contextual embeddings can be used to achieve unprecedented gains in Word Sense Disambiguation (WSD) tasks. Our approach focuses on creating sense-level embeddings with full-coverage of WordNet, and without recourse to explicit knowledge of sense distributions or task-specific modelling. As a result, a simple Nearest Neighbors (k-NN) method using our representations is able to consistently surpass the performance of previous systems using powerful neural sequencing models. We also analyse the robustness of our approach when ignoring part-of-speech and lemma features, requiring disambiguation against the full sense inventory, and revealing shortcomings to beā¦
| F1 / P / R (without MFS) | |||
|---|---|---|---|
| Source | Coverage | BERT | ELMo |
| SemCor | 16.11% | 68.9 / 72.4 / 65.7 | 63.0 / 66.2 / 60.1 |
| + synset | 26.97% | 70.0 / 72.6 / 70.0 | 63.9 / 66.3 / 61.7 |
| + hypernym | 74.70% | 73.0 / 73.6 / 72.4 | 67.2 / 67.7 / 66.6 |
| + lexname | 100% | 73.8 / 73.8 / 73.8 | 68.1 / 68.1 / 68.1 |
| Configurations | LMMS1024 | LMMS2048 | LMMS2348 | ||||
| Embeddings | |||||||
| Contextual (d=1024) | ā | ā | ā | ā | ā | ||
| Dictionary (d=1024) | ā | ā | ā | ā | ā | ||
| Static (d=300) | ā | ā | ā | ||||
| Operation | |||||||
| Average | ā | ||||||
| Concatenation | ā | ā | ā | ā | |||
| Perf. (F1 on ALL) | |||||||
| Lemma & POS | 73.8 | 58.7 | 75.0 | 75.4 | 73.9 | 58.7 | 75.4 |
| Token (Uninformed) | 42.7 | 6.1 | 36.5 | 35.1 | 64.4 | 45.0 | 66.0 |
| Model | Senseval2 | Senseval3 | SemEval2007 | SemEval2013 | SemEval2015 | ALL |
|---|---|---|---|---|---|---|
| (n=2,282) | (n=1,850) | (n=455) | (n=1,644) | (n=1,022) | (n=7,253) | |
| MFSā (Most Frequent Sense) | 65.6 | 66.0 | 54.5 | 63.8 | 67.1 | 64.8 |
| IMSā (2010) | 70.9 | 69.3 | 61.3 | 65.3 | 69.5 | 68.4 |
| IMS + embeddingsā (2016) | 72.2 | 70.4 | 62.6 | 65.9 | 71.5 | 69.6 |
| context2vec -NNā (2016) | 71.8 | 69.1 | 61.3 | 65.6 | 71.9 | 69.0 |
| word2vec -NN (2016) | 67.8 | 62.1 | 58.5 | 66.1 | 66.7 | - |
| LSTM-LP (Label Prop.) (2016) | 73.8 | 71.8 | 63.5 | 69.5 | 72.6 | - |
| Seq2Seq (Task Modelling) (2017b) | 70.1 | 68.5 | 63.1* | 66.5 | 69.2 | 68.6* |
| BiLSTM (Task Modelling) (2017b) | 72.0 | 69.1 | 64.8* | 66.9 | 71.5 | 69.9* |
| ELMo -NN (2018) | 71.5 | 67.5 | 57.1 | 65.3 | 69.9 | 67.9 |
| HCAN (Hier. Co-Attention) (2018a) | 72.8 | 70.3 | -* | 68.5 | 72.8 | -* |
| BiLSTM w/Vocab. Reduction (2018) | 72.6 | 70.4 | 61.5 | 70.8 | 71.3 | 70.8 |
| BERT -NN | 76.3 | 73.2 | 66.2 | 71.7 | 74.1 | 73.5 |
| LMMS2348 (ELMo) | 68.1 | 64.7 | 53.8 | 66.9 | 69.0 | 66.2 |
| LMMS2348 (BERT) | 76.3 | 75.6 | 68.1 | 75.1 | 77.0 | 75.4 |
| WN-POS | NOUN | VERB | ADJ | ADV |
|---|---|---|---|---|
| NOUN | 96.95% | 1.86% | 0.86% | 0.33% |
| VERB | \ul9.08% | 70.82% | \ul19.98% | 0.12% |
| ADJ | \ul4.50% | 0% | 92.27% | 2.93% |
| ADV | 2.02% | 0.29% | 2.60% | 95.09% |
| Marlonā | Brandoā | \ulplayed | Corleoneā | in | Godfatherā |
| _ | |||||
| \pbox1.2: play a role or part; : represent fictiously, as in a play, or pretend to be or act like; : give expression or emotion to, in a stage or movie role. | |||||
| Serenaā | Williams | \ulplayed | Kerberā | in | Wimbledonā |
| _ | |||||
| _ | _ | ||||
| _ | _ | ||||
| \pbox1.2: participate in games or sport; _: take oneās position before a kick-off; : play the Scottish game of curling. | |||||
| David | Bowieā | \ulplayed | Warszawaā | in | Tokyo |
| _ | |||||
| _ | |||||
| __ | |||||
| \pbox1.2: perform on a certain location; : replay (as a melody); : play riffs. |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\useunder
\ul
Language Modelling Makes Sense: Propagating Representations through WordNet for Full-Coverage Word Sense Disambiguation
Daniel Loureiro, AlĆpio MĆ”rio Jorge
LIAAD - INESC TEC
Faculty of Sciences - University of Porto, Portugal
[email protected], [email protected]
Abstract
Contextual embeddings represent a new generation of semantic representations learned from Neural Language Modelling (NLM) that addresses the issue of meaning conflation hampering traditional word embeddings. In this work, we show that contextual embeddings can be used to achieve unprecedented gains in Word Sense Disambiguation (WSD) tasks. Our approach focuses on creating sense-level embeddings with full-coverage of WordNet, and without recourse to explicit knowledge of sense distributions or task-specific modelling. As a result, a simple Nearest Neighbors (-NN) method using our representations is able to consistently surpass the performance of previous systems using powerful neural sequencing models. We also analyse the robustness of our approach when ignoring part-of-speech and lemma features, requiring disambiguation against the full sense inventory, and revealing shortcomings to be improved. Finally, we explore applications of our sense embeddings for concept-level analyses of contextual embeddings and their respective NLMs.
1 Introduction
Word Sense Disambiguation (WSD) is a core task of Natural Language Processing (NLP) which consists in assigning the correct sense to a word in a given context, and has many potential applications Navigli (2009). Despite breakthroughs in distributed semantic representations (i.e. word embeddings), resolving lexical ambiguity has remained a long-standing challenge in the field. Systems using non-distributional features, such as It Makes Sense (IMS, Zhong and Ng, 2010), remain surprisingly competitive against neural sequence models trained end-to-end. A baseline that simply chooses the most frequent sense (MFS) has also proven to be notoriously difficult to surpass.
Several factors have contributed to this limited progress over the last decade, including lack of standardized evaluation, and restricted amounts of sense annotated corpora. Addressing the evaluation issue, Raganato etĀ al. (2017a) has introduced a unified evaluation framework that has already been adopted by the latest works in WSD. Also, even though SemCor Miller etĀ al. (1994) still remains the largest manually annotated corpus, supervised methods have successfully used label propagation Yuan etĀ al. (2016), semantic networks Vial etĀ al. (2018) and glosses Luo etĀ al. (2018b) in combination with annotations to advance the state-of-the-art. Meanwhile, task-specific sequence modelling architectures based on BiLSTMs or Seq2Seq Raganato etĀ al. (2017b) havenāt yet proven as advantageous for WSD.
Until recently, the best semantic representations at our disposal, such as word2vec Mikolov etĀ al. (2013) and fastText Bojanowski etĀ al. (2017), were bound to word types (i.e. distinct tokens), converging information from different senses into the same representations (e.g. āplay songā and āplay tennisā share the same representation of āplayā). These word embeddings were learned from unsupervised Neural Language Modelling (NLM) trained on fixed-length contexts. However, by recasting the same word types across different sense-inducing contexts, these representations became insensitive to the different senses of polysemous words. Camacho-Collados and Pilehvar (2018) refer to this issue as the meaning conflation deficiency and explore it more thoroughly in their work.
Recent improvements to NLM have allowed for learning representations that are context-specific and detached from word types. While word embedding methods reduced NLMs to fixed representations after pretraining, this new generation of contextual embeddings employs the pretrained NLM to infer different representations induced by arbitrarily long contexts. Contextual embeddings have already had a major impact on the field, driving progress on numerous downstream tasks. This success has also motivated a number of iterations on embedding models in a short timespan, from context2vec Melamud etĀ al. (2016), to GPT Radford etĀ al. (2018), ELMo Peters etĀ al. (2018), and BERT Devlin etĀ al. (2019).
Being context-sensitive by design, contextual embeddings are particularly well-suited for WSD. In fact, Melamud etĀ al. (2016) and Peters etĀ al. (2018) produced contextual embeddings from the SemCor dataset and showed competitive results on Raganato etĀ al. (2017a)ās WSD evaluation framework, with a surprisingly simple approach based on Nearest Neighbors (-NN). These results were promising, but those works only produced sense embeddings for the small fraction of WordNet Fellbaum (1998) senses covered by SemCor, resorting to the MFS approach for a large number of instances. Lack of high coverage annotations is one of the most pressing issues for supervised WSD approaches Le etĀ al. (2018).
Our experiments show that the simple -NN w/MFS approach using BERT embeddings suffices to surpass the performance of all previous systems. Most importantly, in this work we introduce a method for generating sense embeddings with full-coverage of WordNet, which further improves results (additional 1.9% F1) while forgoing MFS fallbacks. To better evaluate the fitness of our sense embeddings, we also analyse their performance without access to lemma or part-of-speech features typically used to restrict candidate senses. Representing sense embeddings in the same space as any contextual embeddings generated from the same pretrained NLM eases introspections of those NLMs, and enables token-level intrinsic evaluations based on -NN WSD performance. We summarize our contributions111Code and data: github.com/danlou/lmms below:
- ā¢
A method for creating sense embeddings for all senses in WordNet, allowing for WSD based on -NN without MFS fallbacks.
- ā¢
Major improvement over the state-of-the-art on cross-domain WSD tasks, while exploring the strengths and weaknesses of our method.
- ā¢
Applications of our sense embeddings for concept-level analyses of NLMs.
2 Language Modelling Representations
Distributional semantic representations learned from Unsupervised Neural Language Modelling (NLM) are currently used for most NLP tasks. In this section we cover aspects of word and contextual embeddings, learned from from NLMs, that are particularly relevant for our work.
2.1 Static Word Embeddings
Word embeddings are distributional semantic representations usually learned from NLM under one of two possible objectives: predict context words given a target word (Skip-Gram), or the inverse (CBOW) (word2vec, Mikolov etĀ al., 2013). In both cases, context corresponds to a fixed-length window sliding over tokenized text, with the target word at the center. These modelling objectives are enough to produce dense vector-based representations of words that are widely used as powerful initializations on neural modelling architectures for NLP. As we explained in the introduction, word embeddings are limited by meaning conflation around word types, and reduce NLM to fixed representations that are insensitive to contexts. However, with fastText Bojanowski etĀ al. (2017) weāre not restricted to a finite set of representations and can compositionally derive representations for word types unseen during training.
2.2 Contextual Embeddings
The key differentiation of contextual embeddings is that they are context-sensitive, allowing the same word types to be represented differently according to the contexts in which they occurr. In order to be able to produce new representations induced by different contexts, contextual embeddings employ the pretrained NLM for inferences. Also, the NLM objective for contextual embeddings is usually directional, predicting the previous and/or next tokens in arbitrarily long contexts (usually sentences). ELMo Peters etĀ al. (2018) was the first implementation of contextual embeddings to gain wide adoption, but it was shortly after followed by BERT Devlin etĀ al. (2019) which achieved new state-of-art results on 11 NLP tasks. Interestingly, BERTās impressive results were obtained from task-specific fine-tuning of pretrained NLMs, instead of using them as features in more complex models, emphasizing the quality of these representations.
3 Word Sense Disambiguation (WSD)
There are several lines of research exploring different approaches for WSD Navigli (2009). Supervised methods have traditionally performed best, though this distinction is becoming increasingly blurred as works in supervised WSD start exploiting resources used by knowledge-based approaches (e.g. Luo etĀ al., 2018a; Vial etĀ al., 2018). We relate our work to the best-performing WSD methods, regardless of approach, as well as methods that may not perform as well but involve producing sense embeddings. In this section we introduce the components and related works that are most relevant for our approach.
3.1 Sense Inventory, Attributes and Relations
The most popular sense inventory is WordNet, a semantic network of general domain concepts linked by a few relations, such as synonymy and hypernymy. WordNet is organized at different abstraction levels, which we describe below. Following the notation used in related works, we represent the main structure of WordNet, called synset, with , where corresponds to the canonical form of a word, POS corresponds to the senseās part-of-speech (noun, verb, adjective or adverb), and # further specifies this entry.
- ā¢
Synsets: groups of synonymous words that correspond to the same sense, e.g. .
- ā¢
Lemmas: canonical forms of words, may belong to multiple synsets, e.g. dog is a lemma for and , among others.
- ā¢
Senses: lemmas specifed by sense (i.e. sensekeys), e.g. dog%1:05:00::, and domestic_dog%1:05:00:: are senses of .
Each synset has a number of attributes, of which the most relevant for this work are:
- ā¢
Glosses: dictionary definitions, e.g. has the definition āa member of the genus Caā¦ā.
- ā¢
Hypernyms: ātype ofā relations between synsets, e.g. is a hypernym of .
- ā¢
Lexnames: syntactical and logical groupings, e.g. the lexname for is noun.animal.
In this work weāre using WordNet 3.0, which contains 117,659 synsets, 206,949 unique senses, 147,306 lemmas, and 45 lexnames.
3.2 WSD State-of-the-Art
While non-distributional methods, such as Zhong and Ng (2010)ās IMS, still perform competitively, there are have been several noteworthy advancements in the last decade using distributional representations from NLMs. Iacobacci etĀ al. (2016) improved on IMSās performance by introducing word embeddings as additional features.
Yuan etĀ al. (2016) achieved significantly improved results by leveraging massive corpora to train a NLM based on an LSTM architecture. This work is contemporaneous with Melamud etĀ al. (2016), and also uses a very similar approach for generating sense embeddings and relying on -NN w/MFS for predictions. Although most performance gains stemmed from their powerful NLM, they also introduced a label propagation method that further improved results in some cases. Curiously, the objective Yuan etĀ al. (2016) used for NLM (predicting held-out words) is very evocative of the cloze-style Masked Language Model introduced by Devlin etĀ al. (2019). Le etĀ al. (2018) replicated this work and offers additional insights.
Raganato etĀ al. (2017b) trained neural sequencing models for end-to-end WSD. This work reframes WSD as a translation task where sequences of words are translated into sequences of senses. The best result was obtained with a BiLSTM trained with auxilliary losses specific to parts-of-speech and lexnames. Despite the sophisticated modelling architecture, it still performed on par with Iacobacci etĀ al. (2016).
The works of Melamud etĀ al. (2016) and Peters etĀ al. (2018) using contextual embeddings for WSD showed the potential of these representations, but still performed comparably to IMS.
Addressing the issue of scarce annotations, recent works have proposed methods for using resources from knowledge-based approaches. Luo etĀ al. (2018a) and Luo etĀ al. (2018b) combine information from glosses present in WordNet, with NLMs based on BiLSTMs, through memory networks and co-attention mechanisms, respectively. Vial etĀ al. (2018) follows Raganato etĀ al. (2017b)ās BiLSTM method, but leverages the semantic network to strategically reduce the set of senses required for disambiguating words.
All of these works rely on MFS fallback. Additionally, to our knowledge, all also perform disambiguation only against the set of admissible senses given the wordās lemma and part-of-speech.
3.3 Other methods with Sense Embeddings
Some works may no longer be competitive with the state-of-the-art, but nevertheless remain relevant for the development of sense embeddings. We recommend the recent survey of Camacho-Collados and Pilehvar (2018) for a thorough overview of this topic, and highlight a few of the most relevant methods. Chen etĀ al. (2014) initializes sense embeddings using glosses and adapts the Skip-Gram objective of word2vec to learn and improve sense embeddings jointly with word embeddings. Rothe and Schütze (2015)ās AutoExtend method uses pretrained word2vec embeddings to compose sense embeddings from sets of synonymous words. Camacho-Collados etĀ al. (2016) creates the NASARI sense embeddings using structural knowledge from large multilingual semantic networks.
These methods represent sense embeddings in the same space as the pretrained word embeddings, however, being based on fixed embedding spaces, they are much more limited in their ability to generate contextual representations to match against. Furthermore, none of these methods (or those in §3.2) achieve full-coverage of the +200K senses in WordNet.
4 Method
Our WSD approach is strictly based on -NN (see Figure 1), unlike any of the works referred previously. We avoid relying on MFS for lemmas that do not occur in annotated corpora by generating sense embeddings with full-coverage of WordNet. Our method starts by generating sense embeddings from annotations, as done by other works, and then introduces several enhancements towards full-coverage, better performance and increased robustness. In this section, we cover each of these techniques.
4.1 Embeddings from Annotations
Our set of full-coverage sense embeddings is bootstrapped from sense-annotated corpora. Sentences containing sense-annotated tokens (or spans) are processed by a NLM in order to obtain contextual embeddings for those tokens. After collecting all sense-labeled contextual embeddings, each sense embedding is determined by averaging its corresponding contextual embeddings. Formally, given contextual embeddings for some sense :
[TABLE]
In this work we use pretrained ELMo and BERT models to generate contextual embeddings. These models can be identified and replicated with the following details:
- ā¢
ELMo: 1024 (2x512) embedding dimensions, 93.6M parameters. Embeddings from top layer (2).
- ā¢
BERT: 1024 embedding dimensions, 340M parameters, cased. Embeddings from sum of top 4 layers ([-1,-4])222This was the configuration that performed best out of the ones on Table 7 of Devlin etĀ al. (2018)..
BERT uses WordPiece tokenization that doesnāt always map to token-level annotations (e.g. āmultiplicationā becomes āmultiā, ā##plicationā). We use the average of subtoken embeddings as the token-level embedding. Unless specified otherwise, our LMMS method uses BERT.
4.2 Extending Annotation Coverage
As many have emphasized before Navigli (2009); Camacho-Collados and Pilehvar (2018); Le etĀ al. (2018), the lack of sense annotations is a major limitation of supervised approaches for WSD. We address this issue by taking advantage of the semantic relations in WordNet to extend the annotated signal to other senses. Semantic networks are often explored by knowledge-based approaches, and some recent works in supervised approaches as well Luo etĀ al. (2018a); Vial etĀ al. (2018). The guiding principle behind these approaches is that sense-level representations can be imputed (or improved) from other representations that are known to correspond to generalizations due to the networkās taxonomical structure. Vial etĀ al. (2018) leverages relations in WordNet to reduce the sense inventory to a minimal set of entries, making the task easier to model while maintaining the ability to distinguish senses. We take the inverse path of leveraging relations to produce representations for additional senses.
On §3.1 we covered synsets, hypernyms and lexnames, which correspond to increasingly abstract generalizations. Missing sense embeddings are imputed from the aggregation of sense embeddings at each of these abstraction levels. In order to get embeddings that are representative of higher-level abstractions, we simply average the embeddings of all lower-level constituents. Thus, a synset embedding corresponds to the average of all of its sense embeddings, a hypernym embedding corresponds to the average of all of its synset embeddings, and a lexname embedding corresponds to the average of a larger set of synset embeddings. All lower abstraction representations are created before next-level abstractions to ensure that higher abstractions make use of lower generalizations. More formally, given all missing senses in WordNet , their synset-specific sense embeddings , hypernym-specific synset embeddings , and lexname-specific synset embeddings , the procedure has the following stages:
[TABLE]
In Table 1 we show how much coverage extends while improving both recall and precision.
4.3 Improving Senses using the Dictionary
Thereās a long tradition of using glosses for WSD, perhaps starting with the popular work of Lesk (1986), which has since been adapted to use distributional representations Basile etĀ al. (2014). As a sequence of words, the information contained in glosses can be easily represented in semantic spaces through approaches used for generating sentence embeddings. There are many methods for generating sentence embeddings, but itās been shown that a simple weighted average of word embeddings performs well Arora etĀ al. (2017).
Our contextual embeddings are produced from NLMs using attention mechanisms, assigning more importance to some tokens over others, so they already come āpre-weightedā and we embed glosses simply as the average of all of their contextual embeddings (without preprocessing). Weāve also found that introducing synset lemmas alongside the words in the gloss helps induce better contextualized embeddings (specially when glosses are short). Finally, we make our dictionary embeddings () sense-specific, rather than synset-specific, by repeating the lemma thatās specific to the sense, alongside the synsetās lemmas and gloss words. The result is a sense-level embedding, determined without annotations, that is represented in the same space as the sense embeddings we described in the previous section, and can be trivially combined through concatenation or average for improved performance (see Table 2).
Our empirical results show improved performance by concatenation, which we attribute to preserving complementary information from glosses. Both averaging and concatenating representations (previously normalized) also serves to smooth possible biases that may have been learned from the SemCor annotations. Note that while concatenation effectively doubles the size of our embeddings, this doesnāt equal doubling the expressiveness of the distributional space, since theyāre two representations from the same NLM. This property also allows us to make predictions for contextual embeddings (from the same NLM) by simply repeating those embeddings twice, aligning contextual features against sense and dictionary features when computing cosine similarity. Thus, our sense embeddings become:
[TABLE]
4.4 Morphological Robustness
WSD is expected to be performed only against the set of candidate senses that are specific to a target wordās lemma. However, as weāll explain in §5.3, there are cases where itās undesirable to restrict the WSD process.
We leverage word embeddings specialized for morphological representations to make our sense embeddings more resilient to the absence of lemma features, achieving increased robustness. This addresses a problem arising from the susceptibility of contextual embeddings to become entirely detached from the morphology of their corresponding tokens, due to interactions with other tokens in the sentence.
We choose fastText Bojanowski et al. (2017) embeddings (pretrained on CommonCrawl), which are biased towards morphology, and avoid Out-of-Vocabulary issues as explained in §2.1. We use fastText to generate static word embeddings for the lemmas () corresponding to all senses, and concatenate these word embeddings to our previous embeddings. When making predictions, we also compute fastText embeddings for tokens, allowing for the same alignment explained in the previous section. This technique effectively makes sense embeddings of morphologically related lemmas more similar. Empirical results (see Table 2) show that introducing these static embeddings is crucial for achieving satisfactory performance when not filtering candidate senses.
Our final, most robust, sense embeddings are thus:
[TABLE]
5 Experiments
Our experiments centered on evaluating our solution on Raganato etĀ al. (2017a)ās set of cross-domain WSD tasks. In this section we compare our results to the current state-of-the-art, and provide results for our solution when disambiguating against the full set of possible senses in WordNet, revealing shortcomings to be improved.
5.1 All-Words Disambiguation
In Table 3 we show our results for all tasks of Raganato etĀ al. (2017a)ās evaluation framework. We used the frameworkās scoring scripts to avoid any discrepancies in the scoring methodology. Note that the -NN referred in Table 3 always refers to the closest neighbor, and relies on MFS fallbacks.
The first noteworthy result we obtained was that simply replicating Peters etĀ al. (2018)ās method for WSD using BERT instead of ELMo, we were able to significantly, and consistently, surpass the performance of all previous works. When using our method (LMMS), performance still improves significantly over the previous impressive results (+1.9 F1 on ALL, +3.4 F1 on SemEval 2013). Interestingly, we found that our method using ELMo embeddings didnāt outperform ELMo -NN with MFS fallback, suggesting that itās necessary to achieve a minimum competence level of embeddings from sense annotations (and glosses) before the inferred sense embeddings become more useful than MFS.
In Figure 2 we show results when considering additional neighbors as valid predictions, together with a random baseline considering that some target words may have less senses than the number of accepted neighbors (always correct).
5.2 Part-of-Speech Mismatches
The solution we introduced in §4.4 addressed missing lemmas, but we didnāt propose a solution that addressed missing POS information. Indeed, the confusion matrix in Table 4 shows that a large number of target words corresponding to verbs are wrongly assigned senses that correspond to adjectives or nouns. We believe this result can help motivate the design of new NLM tasks that are more capable of distinguishing between verbs and non-verbs.
5.3 Uninformed Sense Matching
WSD tasks are usually accompanied by auxilliary parts-of-speech (POSs) and lemma features for restricting the number of possible senses to those that are specific to a given lemma and POS. Even if those features arenāt provided (e.g. real-world applications), itās sensible to use lemmatizers or POS taggers to extract them for use in WSD. However, as is the case with using MFS fallbacks, this filtering step obscures the true impact of NLM representations on -NN solutions.
Consequently, we introduce a variation on WSD, called Uninformed Sense Matching (USM), where disambiguation is always performed against the full set of sense embeddings (i.e. +200K vs. a maximum of 59). This change makes the task much harder (results on Table 2), but offers some insights into NLMs, which we cover briefly in §5.4.
5.4 Use of World Knowledge
Itās well known that WSD relies on various types of knowledge, including commonsense and selectional preferences Lenat etĀ al. (1986); Resnik (1997), for example. Using our sense embeddings for Uninformed Sense Matching allows us to glimpse into how NLMs may be interpreting contextual information with regards to the knowledge represented in WordNet. In Table 5.4 we show a few examples of senses matched at the token-level, suggesting that entities were topically understood and this information was useful to disambiguate verbs. These results would be less conclusive without full-coverage of WordNet.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Arora et al. (2017) Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-to-beat baseline for sentence embeddings . In International Conference on Learning Representations (ICLR) .
- 2Basile et al. (2014) Pierpaolo Basile, Annalina Caputo, and Giovanni Semeraro. 2014. An enhanced Lesk word sense disambiguation algorithm through a distributional semantic model . In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers , pages 1591ā1600, Dublin, Ireland. Dublin City University and Association for Computational Linguistics.
- 3Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information . Transactions of the Association for Computational Linguistics , 5:135ā146. Ā· doiĀ ā
- 4Bolukbasi et al. (2016) Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai. 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings . In Proceedings of the 30th International Conference on Neural Information Processing Systems , NIPSā16, pages 4356ā4364, USA. Curran Associates Inc.
- 5Caliskan et al. (2017) Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases . Science , 356(6334):183ā186. Ā· doiĀ ā
- 6Camacho-Collados and Pilehvar (2018) Jose Camacho-Collados and Mohammad Taher Pilehvar. 2018. From word to sense embeddings: A survey on vector representations of meaning . J. Artif. Int. Res. , 63(1):743ā788. Ā· doiĀ ā
- 7Camacho-Collados et al. (2016) Jose Camacho-Collados, Mohammad Taher Pilehvar, and Roberto Navigli. 2016. Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities . Artificial Intelligence , 240:36 ā 64. Ā· doiĀ ā
- 8Chen et al. (2014) Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. 2014. A unified model for word sense representation and disambiguation . In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 1025ā1035, Doha, Qatar. Association for Computational Linguistics. Ā· doiĀ ā
