TL;DR
This paper investigates methods to improve word sense accuracy in limited-topic corpora, finding that corpus augmentation often outperforms traditional embedding adaptation techniques, which may irreversibly lose sense information.
Contribution
It introduces a new regularizer based on cooccurrence stability and proposes using topic-rich source corpora for augmentation instead of embedding adaptation.
Findings
Regularization based on cooccurrence stability improves sense accuracy.
Corpus augmentation outperforms embedding adaptation in limited data settings.
Pretrained embeddings may irreversibly lose non-dominant sense information.
Abstract
Given a small corpus pertaining to a limited set of focused topics, our goal is to train embeddings that accurately capture the sense of words in the topic in spite of the limited size of . These embeddings may be used in various tasks involving . A popular strategy in limited data settings is to adapt pre-trained embeddings trained on a large corpus. To correct for sense drift, fine-tuning, regularization, projection, and pivoting have been proposed recently. Among these, regularization informed by a word's corpus frequency performed well, but we improve upon it using a new regularizer based on the stability of its cooccurrence with other words. However, a thorough comparison across ten topics, spanning three tasks, with standardized settings of hyper-parameters, reveals that even the best embedding adaptation strategies provideā¦
| Method | Physics | Gaming | Android | Unix |
|---|---|---|---|---|
| Tgt | 121.9 | 185.0 | 142.7 | 159.5 |
| Tgt(unpinned) | -0.6 | -0.8 | 0.2 | 0.1 |
| Method | Physics | Gaming | Android | Unix | Med |
|---|---|---|---|---|---|
| Tgt | 121.9 | 185.0 | 142.7 | 159.5 | 158.9 |
| SrcTune | |||||
| RegFreq | |||||
| RegSense | |||||
| SrcSel | |||||
| SrcSel | |||||
| +RegSense |
| Physics | Gaming | Android | Unix | |
|---|---|---|---|---|
| Tgt | 86.7 | 82.6 | 86.8 | 85.4 |
| Src | -2.3 | 0.8 | -3.7 | -7.1 |
| Concat | -1.1 | 1.4 | -2.1 | -4.5 |
| AAEME | 1.2 | 4.6 | -0.3 | 0.0 |
| SrcTune | -0.3 | 1.9 | 0.6 | -0.0 |
| RegFreq | -0.4 | 2.4 | -0.5 | -0.5 |
| RegSense | -0.4 | 2.2 | -0.5 | -0.5 |
| SrcSel | 3.6 | 3.0 | 0.8 | 2.1 |
| SrcSel | 3.6 | 3.1 | 0.8 | 2.1 |
| +RegSense |
| Ohsumed | 20NG Avg | |||
| Method | Micro | Macro | Rare | 5 topics |
| Tgt | 26.3 | 14.7 | 3.0 | 88.9 |
| Src | -1.0 | 0. | 0. | -3.9 |
| AAEME | -1.0 | 0. | 0. | -3.9 |
| SrcTune | 1.7 | 1.8 | 1.5 | 0.0 |
| RegFreq | 0.6 | 1.8 | 3.7 | - |
| RegSense | 1.4 | 2.5 | 4.0 | 0.4 |
| SrcSel | 2.0 | 2.6 | 1.1 | 0.5 |
| SrcSel | 2.3 | 3.4 | 4.3 | 0.5 |
| +RegSense | ||||
| Physics | Gaming | Android | Unix | |
| RegFreqās reduction in Perplexity over Tgt | ||||
| Original | 1.1 | 1.5 | 0.9 | 0.7 |
| +SrcInit | 2.1 | 5.7 | 1.1 | 2.1 |
| RegFreqās gain in AUC over Tgt | ||||
| Original | -1.2 | 0.1 | -0.2 | -0.4 |
| +SrcInit | -0.4 | 2.4 | -0.5 | -0.5 |
| Pair | Tgt | Src | Reg | Reg | Src |
| Tune | Freq | Sense | Sel | ||
| Unix topic | |||||
| nice, kill | 4.6 | 4.5 | 4.4 | 4.4 | 5.2 |
| vim, emacs | 5.7 | 5.8 | 5.7 | 5.8 | 6.4 |
| print, cat | 5.0 | 4.9 | 4.9 | 5.0 | 5.4 |
| kill, job | 5.2 | 5.1 | 5.2 | 5.3 | 5.8 |
| make, install | 5.1 | 5.1 | 5.3 | 5.7 | 5.8 |
| character, unicode | 4.9 | 5.1 | 4.7 | 4.6 | 5.8 |
| Physics topic | |||||
| lie, group | 5.2 | 5.0 | 4.4 | 5.1 | 5.8 |
| current, electron | 5.3 | 5.3 | 4.7 | 5.3 | 5.7 |
| potential, kinetic | 5.8 | 5.8 | 4.5 | 5.9 | 6.1 |
| rotated, spinning | 5.0 | 5.7 | 6.0 | 5.1 | 5.6 |
| x-ray, x-rays | 5.3 | 7.0 | 6.1 | 5.5 | 6.4 |
| require, cost | 4.9 | 6.2 | 5.2 | 5.1 | 5.3 |
| cool, cooling | 5.6 | 6.0 | 6.4 | 5.7 | 5.7 |
| Physic | Game | Andrd | Unix | Med(Rare) | |
|---|---|---|---|---|---|
| Tgt | 89.7 | 88.4 | 89.4 | 89.2 | 9.4 |
| SrcTune | 0.6 | ||||
| SrcSel | 0.5 | 0.0 | 1.1 |
| Physic | Game | Andrd | Unix | Med | |
|---|---|---|---|---|---|
| Tgt | 86.7 | 82.6 | 86.8 | 85.4 | 26.3 |
| ELMo | 4.5 | 3.2 | |||
| +Tgt | 3.8 | 0.5 | 0.0 | 4.1 | |
| +SrcTune | 3.0 | 0.3 | 0.2 | 3.5 | |
| +SrcSel | 2.6 | 4.1 | 1.1 | 1.5 | 4.6 |
| Method | Physics | Gaming | Android | Unix |
|---|---|---|---|---|
| BERT | 87.5 | 85.3 | 87.4 | 82.7 |
| SrcTune | 88.0 | 89.2 | 88.5 | 83.5 |
| SrcSel:R | 87.9 | 88.4 | 88.6 | 85.1 |
| Tokens | Vocab size | # duplicates | |
|---|---|---|---|
| Physics | 542K | 6,026 | 1981 |
| Gaming | 302K | 6,748 | 3386 |
| Android | 235K | 4,004 | 3190 |
| Unix | 262K | 6,358 | 5312 |
| Method | Physic | Gamng | Andrd | Unix | Med |
|---|---|---|---|---|---|
| Tgt | 121.9 | 185.0 | 142.7 | 159.5 | 158.9 |
| SrcTune | |||||
| RegFreq | |||||
| RegSens | |||||
| SrcSel | |||||
| SrcSel+ | |||||
| RegSens |
| Physic | Game | Andrd | Unix | Med | |
|---|---|---|---|---|---|
| Tgt | 86.7 | 82.6 | 86.8 | 85.4 | 26.3 |
| Elmo | -1.0 | 4.5 | -1.5 | -2.3 | 3.2 |
| +Tgt | -0.8 | 3.8 | 0.5 | -0.0 | 4.1 |
| +ST | -0.5 | 3.0 | 0.3 | 0.2 | 3.5 |
| +SrcSel | 2.6 | 4.1 | 1.1 | 1.5 | 4.6 |
| Micro Accuracy | Macro Accuracy | |
|---|---|---|
| Tgt | 26.3±0.5 | 14.7±1.2 |
| SrcSel:R | 27.3±0.3 | 16.1±1.6 |
| SrcSel | 28.3±0.4 | 17.3±0.7 |
| LM Perplexity | ||||
| Physics | Gaming | Android | Unix | |
| Tgt | 121.9±0.6 | 185.0±0.3 | 142.7±2.7 | 159.5±1.2 |
| SSR | 114.8±0.2 | 172.7±1.5 | 131.6±0.7 | 151.8±1.1 |
| SrcSel | 116.1±0.9 | 173.3±0.6 | 136.7±1.1 | 153.1±0.1 |
| Question Dedup: AUC | ||||
| Physics | Gaming | Android | Unix | |
| Tgt | 86.7±0.4 | 82.6±0.4 | 86.8±0.5 | 85.3±0.3 |
| SrcSel:R | 89.2±0.2 | 85.6±0.4 | 87.5±0.3 | 86.8±0.2 |
| SrcSel:c | 88.7±0.3 | 84.8±0.3 | 87.0±0.5 | 85.8±0.3 |
| SrcSel | 90.4±0.2 | 85.4±0.5 | 87.4±0.4 | 87.5±0.1 |
| Sci | Com | Pol | Rel | Rec | |
|---|---|---|---|---|---|
| Tgt | 92.2 | 79.9 | 94.8 | 87.3 | 90.3 |
| Src | -0.1 | -9.1 | -3.3 | -1.0 | -6.0 |
| ST | 0.0 | 0.0 | -0.1 | 0.1 | 0.2 |
| RS | 0.9 | -0.2 | 0.2 | 1.2 | 0.1 |
| SrcSel | 1.2 | 0.1 | 0.5 | 0.5 | 0.3 |
| Perplexity | AUC | |||||||
| Physics | Gaming | Android | Unix | Physics | Gaming | Android | Unix | |
| Tgt | 121.9 | 185.0 | 142.7 | 159.5 | 86.7 | 82.6 | 86.8 | 85.3 |
| RegFreq | 2.1 | 7.0 | 1.8 | 3.4 | -0.4 | 2.3 | -0.6 | -0.3 |
| RegFreq-rinit | -1.6 | 1.2 | 1.6 | 2.6 | -1.2 | 0. | -0.2 | -0.3 |
| RegSense | 5.0 | 13.8 | 6.7 | 9.7 | -0.3 | 2.1 | -0.6 | -0.3 |
| RegSense-rinit | 3.6 | 11.1 | 7.0 | 8.9 | 0.7 | 1.2 | -0.3 | -0.2 |
| SrcSel | 5.8 | 11.7 | 6.0 | 6.3 | 3.7 | 2.8 | 0.6 | 2.2 |
| SrcSel:R-rinit | 5.8 | 12.5 | 10.4 | 7.9 | 2.5 | 2. | 0.4 | 1.5 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Topic Sensitive Attention on Generic Corpora Corrects Sense Bias in Pretrained Embeddings
Vihari Piratla
IITĀ Bombay
&Sunita Sarawagi
IITĀ Bombay
&Soumen Chakrabarti
IITĀ Bombay ā[email protected]
Abstract
Given a small corpus pertaining to a limited set of focused topics, our goal is to train embeddings that accurately capture the sense of words in the topic in spite of the limited size of . These embeddings may be used in various tasks involvingĀ . A popular strategy in limited data settings is to adapt pretrained embeddings trained on a large corpus. To correct for sense drift, fine-tuning, regularization, projection, and pivoting have been proposed recently. Among these, regularization informed by a wordās corpus frequency performed well, but we improve upon it using a new regularizer based on the stability of its cooccurrence with other words. However, a thorough comparison across ten topics, spanning three tasks, with standardized settings of hyper-parameters, reveals that even the best embedding adaptation strategies provide small gains beyond well-tuned baselines, which many earlier comparisons ignored. In a bold departure from adapting pretrained embeddings, we propose using to probe, attend to, and borrow fragments from any large, topic-rich source corpus (such as Wikipedia), which need not be the corpus used to pretrain embeddings. This step is made scalable and practical by suitable indexing. We reach the surprising conclusion that even limited corpus augmentation is more useful than adapting embeddings, which suggests that non-dominant sense information may be irrevocably obliterated from pretrained embeddings and cannot be salvaged by adaptation. All our code and data splits will be made publicly available at https://github.com/vihari/focussed_embs.
1 Introduction
Word embeddings Mikolov etĀ al. (2013); Pennington etĀ al. (2014) benefit many natural language processing (NLP) tasks. Often, a group of tasks may involve a limited corpus pertaining to a few focused topics, e.g., discussion boards on Physics, video games, or Unix, or a forum for discussing medical literature. Because may be too small to train word embeddings to sufficient quality, a prevalent practice is to harness general-purpose embeddings pretrained on a broad-coverage corpus, not tailored to the topics of interest. The pretrained embeddings are sometimes used as-is (āpinnedā). Even if is trained on a āuniversalā corpus, considerable sense shift may exist in the meaning of polysemous words and their cooccurrences and similarities with other words. In a corpus about Unix, ācatā and āprintā are more similar than in Wikipedia. āChargeā and āpotentialā are more related in a Physics corpus than in Wikipedia. Thus, pinning can lead to poor target task performance in case of serious sense mismatch. Another popular practice is to initialize the target embeddings to the pretrained vectors, but then āfine-tuneā using to improve performance in the targetĀ Mou etĀ al. (2015); Min etĀ al. (2017); Howard and Ruder (2018). As we shall see, the number of epochs of fine-tuning is a sensitive knob ā excessive fine-tuning might lead to ācatastrophic forgettingā (Kirkpatrick etĀ al., 2017) of useful word similarities in Ā , and too little fine-tuning may not adapt to target sense.
Even if we are given development (ādevā) sets for target tasks, the best balancing act between a pretrained and a topic-focused is far from clear. Should we fine-tune (all word vectors) in epochs and stop when dev performance deteriorates? Or should we keep some words close to their pretrained embeddings (a form of regularization) and allow others to tune more aggressively? On what properties of and should the regularization strength of each word depend? Our first contribution is a new measure of semantic drift of a word from to , which can be used to control the regularization strength. In terms of perplexity, we show that this is superior to both epoch-based tuning, as well as regularization based on simple corpus frequencies of words (Yang etĀ al., 2017). Yet another option is to learn projections to align generic embeddings to the target senseĀ Bollegala etĀ al. (2015); Barnes etĀ al. (2018); KĀ Sarma etĀ al. (2018), or to a shared common spaceĀ Yin and Schütze (2016); Coates and Bollegala (2018); Bollegala and Bao (2018) However, in carefully controlled experiments, none of the proposed approaches to adapting pretrained embeddings consistently beats the trivial baseline of discarding them and training afresh onĀ !
Our second contribution is to explore other techniques beyond adapting generic embeddings . Often, we might additionally have easy access to a broad corpus like Wikipedia. may span many diverse topics, while focuses on one or few, so there may be large overall drift from to too. However, a judicious subset may exist that would be excellent for augmentingĀ . The large size of is not a problem: we use an inverted index that we probe with documents from to efficiently identifyĀ . Then we apply a novel perplexity-based joint loss over to fit adapted word embeddings. While most of recent research focus has been on designing better methods of adapting pretrained embeddings, we show that retraining with selected source text is significantly more accurate than the best of embeddings-only strategy, while runtime overheads are within practical limits.
An important lesson is that non-dominant sense information may be irrevocably obliterated from generic embeddings; it may not be possible to salvage this information by post-facto adaptation.
Summarizing, our contributions are:
- ā¢
We propose new formulations for training topic-specific embeddings on a limited target corpus by (1) adapting generic pre-trained word embeddings , and/or (2) selecting from any available broad-coverage corpus .
- ā¢
We perform a systematic comparison of our and several recent methods on three tasks spanning ten topics and offer many insights.
- ā¢
Our selection of from and joint perplexity minimization on perform better than pure embedding adaptation methods, at the (practical) cost of processingĀ .
- ā¢
We evaluate our method even with contextual embeddings. The relative performance of the adaptation alternatives remain fairly stable whether the adapted embeddings are used on their own, or concatenated with context-sensitive embeddingsĀ (Peters etĀ al., 2018; Cer etĀ al., 2018).
2 Related work and baselines
CBOW
We review the popular CBOW model for learning unsupervised word representationsĀ (Mikolov etĀ al., 2013). As we scan the corpus, we collect a focus word and a set of context words around it, with corresponding embedding vectors and , where . The two embedding matrices are estimated as:
[TABLE]
Here is the average of the context vectors inĀ . is a negative focus word sampled from a slightly distorted unigram distribution ofĀ . Usually downstream applications use only the embedding matrix , with each word vector scaled to unit length. Apart from CBOW, Mikolov etĀ al. (2013) defined the related skipgram model, and Pennington etĀ al. (2014) proposed the Glove model, which can also be used in our framework. We found CBOW to work better for our downstream tasks.
Src, Tgt and Concat baselines
In the āSrcā option, pre-trained embeddings trained only on a large corpus are used as-is. The other extreme, called āTgtā, is to train word embeddings from scratch on the limited target corpusĀ . In our experiments we found that SrcĀ performs much worse than Tgt, indicating the presence of significant drift in prominent word senses. Two other simple baselines, are āConcatā, that concatenates the source and target trained embeddings and let the downstream task figure out their relative roles, and āAvgā that following Coates and Bollegala (2018) takes their simple average. Another option is to let the downstream task learn to combine multiple embeddings as in Zhang etĀ al. (2016).
As word embeddings have gained popularity for representing text in learning models, several methods have been proposed for enriching small datasets with pre-trained embeddings.
Adapting pre-trained embeddings
SrcTune:
A popular methodĀ (Min etĀ al., 2017; Wang etĀ al., 2017; Howard and Ruder, 2018) is to use the source embeddings to initialize and thereafter train onĀ . We call this āSrcTuneā. Fine-tuning requires careful control of the number of epochs with which we train onĀ . Excessive training can wipe out any benefit of the source because of catastrophic forgetting. Insufficient training may not incorporate target corpus senses in case of polysemous words, and adversely affect target tasksĀ (Mou etĀ al., 2015). The number of epochs can be controlled using perplexity on a held-out , or using downstream tasks. Howard and Ruder (2018) propose to fine-tune a whole language model using careful differential learning rates. However, epoch-based termination may be inadequate. Different words may need diverse trade-offs between the source and target topics, which we discuss next.
RegFreqĀ (frequency-based regularization):
Yang etĀ al. (2017) proposed to train word embeddings using , but with a regularizer to prevent a word ās embedding from drifting too far from the source embedding (). The weight of the regularizer is meant to be inversely proportional to the concept drift of across the two corpus. Their limitation was that corpus frequency was used as a surrogate for stability; high stability was awarded to only words frequent in both corpora. As a consequence, very few words in a focused about Physics will benefit from a broad coverage corpus like Wikipedia. Thousands of words like galactic, stars, motion, x-ray, and momentum will get low stability, although their prominent sense is the same in the two corpora. We propose a better regularization scheme in this paper. Unlike us, Yang etĀ al. (2017) did not compare with fine-tuning.
Projection-based methods
attempt to project embeddings of one kind to another, or to a shared common space. Bollegala etĀ al. (2014) and Barnes etĀ al. (2018) proposed to learn a linear transformation between the source and target embeddings. Yin and Schütze (2016) transform multiple embeddings to a common āmeta-embeddingā space. Simple averaging are also shown to be effective (Coates and Bollegala, 2018), and a recentĀ Bollegala and Bao (2018) auto-encoder based meta-embedder (AEME) is the state of the art. KĀ Sarma etĀ al. (2018) proposed CCA to project both embeddings to a common sub-space. Some of these methods designate a subset of the overlapping words as pivots to bridge the target and source parameters in various waysĀ Blitzer etĀ al. (2006); Ziser and Reichart (2018); Bollegala etĀ al. (2015). Many such techniques were proposed in a cross-domain setting, and specifically for the sentiment classification task. Gains are mainly from effective transfer of sentiment representation across domains. Our challenge arises when a corpus with broad topic coverage pretrains dominant word senses quite different from those needed by tasks associated with narrower topics.
Language models for task transfer
Complementary to the technique of adapting individual word embeddings is the design of deeper sequence models for task-to-task transfer. Cer etĀ al. (2018); Subramanian etĀ al. (2018) propose multi-granular transfer of sentence and word representations across tasks using Universal Sentence Encoders. ELMo (Peters etĀ al., 2018) trains a multi-layer sequence model to build a context-sensitive representation of words in a sentence. ULMFiT (Howard and Ruder, 2018) present additional tricks such as gradual unfreezing of parameters layer-by-layer, and exponentially more aggressive fine-tuning toward output layers. Devlin etĀ al. (2018) propose a deep bidirectional language model for generic contextual word embeddings. We show that our topic-sensitive embeddings provide additional benefit even when used with contextual embeddings.
3 Proposed approaches
We explore two families of methods: (1)Ā those that have access to only pretrained embeddings (SecĀ 3.1), and (2)Ā those that also have access to a source corpus with broad topic coverage (SecĀ 3.2).
3.1 RegSense: Stability-based regularization
Our first contribution is a more robust definition of stability to replace the frequency-based regularizer of RegFreq. We first train word vectors on , and assume the pretrained embeddings are available. Let the focus embeddings of word in and be and . We overload as words that occur in both. For each word , we compute , the nearest neighbors of with respect to the generic embeddings, i.e., with the largest values of from . Here is a suitable hyperparameter. Now we define
[TABLE]
Intuitively, if we consider near neighbors of in terms of source embeddings, and most of these ās have target embeddings very similar to the target embedding of , then is stable across and , i.e., has low semantic drift from toĀ .
While many other forms of can achieve the same ends, ours seems to be the first formulation that goes beyond mere word frequency and employs the topological stability of near-neighbors in the embedding space. Here is why this is important. Going from a generic corpus like Wikipedia to the very topic-focused StackExchange (Physics) corpus , the words x-ray, universe, kilometers, nucleons, absorbs, emits, sqrt, anode, diodes, and km/h have large stability per our definition above, but low stability according to Yang etĀ al.ās frequency method since they are (relatively) rare in source. Using their method, therefore, these words will not benefit from reliable pretrained embeddings.
Finally, the word regularization weight is:
[TABLE]
Here is a hyperparameter. above is a replacement for the regularizer used by Yang etĀ al. (2017). If is large, it is regularized more heavily toward its source embedding, keeping closer toĀ . The modified CBOW loss is:
[TABLE]
Our performs better than Yang etĀ al.ās.
3.2 Source selection and joint perplexity
To appreciate the limitations of regularization, consider words like potential, charge, law, field, matter, medium, etc. These will get small stability () values because their dominant senses in a universal corpus do not match with those in a Physics corpusĀ (), but may be too limited to wipe that dominant sense for a subset of words while preserving the meaning of stable words. However, there are plenty of high-quality broad-coverage sources like Wikipedia that includes plenty of Physics documents that could gainfully supplementĀ . Therefore, we seek to include target-relevant documents from a generic source corpus , even if the dominant sense of a word in does not match that inĀ . The goal is to do this without solving the harder problem of unsupervised, expensive and imperfect sense discovery in and sense tagging ofĀ , and using per-sense embeddings.
The main steps of the proposed approach, SrcSel, are shown in FigureĀ 1. Before describing the steps in detail, we note that preparing and probing a standard inverted index (Baeza-Yates and Ribeiro-Neto, 1999) are extremely fast, owing to decades of performance optimization. Also, index preparation can be amortized over multiple target tasks. (The granularity of a ādocumentā can be adjusted to the application.)
Selecting source documents to retain:
Let be source and target documents. Let be the similarity between them, in terms of the TFIDF cosine score commonly used in Information Retrieval (Baeza-Yates and Ribeiro-Neto, 1999). The total vote of for is then . We choose a suitable cutoff on this aggregate score, to reduce to , as follows. Intuitively, if we hold out a randomly sampled part of , our cutoff should let through a large fraction (we used 90%) of the held-out part. Once we find such a cutoff, we apply it to and retain the source documents whose aggregate scores exceed the cutoff. Beyond mere selection, we design a joint perplexity objective over , with a term for the amount of trust we place in a retained source document. This limits damage from less relevant source documents that slipped through the text retrieval filter. Since the retained documents are weighted based on their relevance to the topical target corpus , we found it beneficial to also include a percentage (we used 10%) of randomly selected documents from . We refer to the method that only uses documents retained using text retrieval filter as SrcSel:R and only randomly selected documents from as SrcSel:c. SrcSel uses documents both from the retrieval filter and random selection.
Joint perplexity objective:
Similar to Eqn.Ā (1), we will sample word and context from and . Given our limited trust in , we will give each sample from an alignment score . This should be large when is used in a context similar to contexts inĀ . We judge this based on the target embeddingĀ :
[TABLE]
Since represents the sense of the word in the target, source contexts which are similar will get a high score. Similarity in source embeddings is not used here because our intent is to preserve the target senses. We tried other forms such as dot-product or its exponential and chose the above form because it is bounded and hence less sensitive to gross noise in inputs.
The word2vec objectiveĀ (1) is enhanced to
[TABLE]
The first sum is the regular word2vec loss overĀ . Word is sampled from the vocabulary of as usual, according to a suitable distribution. The second sum is over the retained source documentsĀ . Note that is computed using the pre-trained target embeddings and does not change during the course of training.
SrcSel+RegSense combo:
Here we combine objective (6) with the regularization term in (4), where uses all ofĀ as in RegSense.
4 Experiments
We compare the methods discussed thus far, with the goal of answering these research questions:
Can word-based regularization (RegFreqĀ and RegSense) beat careful termination at epoch granularity, after initializing with source embeddings (SrcTune)? 2. 2.
How do these compare with just fusing SrcĀ and TgtĀ via recent meta-embedding methods like AAEMEĀ Bollegala and Bao (2018)111We used the implementation available at: https://github.com/CongBao/AutoencodedMetaEmbedding? 3. 3.
Does SrcSelĀ provide sufficient and consistent gains over RegSenseĀ to justify the extra effort of processing a source corpus? 4. 4.
Do contextual embeddings obviate the need for adapting word embeddings?
We also establish that initializing with source embeddings also improves regularization methods. (Curiously, RegFreqĀ was never combined with source initialization.)
Topics and tasks
We compare across 15 topic-task pairs spanning 10 topics and 3 task types: an unsupervised language modeling task on five topics, a document classification task on six topics, and a duplicate question detection task on four topics. In our setting, covers a small subset of topics in , which is the 20160901222The target corpora in our experiments came from datasets that were created before this time. version dump of Wikipedia. Our tasks are different from GLUE-like multi-task learning (Wang etĀ al., 2019), because our focus is on the problems created by the divergence between prominent sense-dominated generic word embeddings and their sense in narrow target topics. We do not experiment on the cross-domain sentiment classification task popular in domain adaptation papers since they benefit more from sharing sentiment-bearing words, than learning the correct sense of polysemous words, which is our focus here. All our experiments are on public datasets, and we will publicly release our experiment scripts and code.
StackExchange topics
We pick four topics (Physics, Gaming, Android and Unix) from the CQADupStack333http://nlp.cis.unimelb.edu.au/resources/cqadupstack/ dataset of questions and responses. For each topic, the available response text is divided into , used for training/adapting embeddings, and , the evaluation fold used to measure perplexity. In each topic, the target corpus has 2000 responses totalling roughly 1 MB. We also report results with changing sizes of . Depending on the method we use , or to train topic-specific embeddings and evaluate them as-is on two tasks that train task-specific layers on top of these fixed embeddings. The first is an unsupervised language modeling task where we train a LSTM444https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/ptb_word_lm.py on the adapted embeddings (which are pinned) and report perplexity onĀ . The second is a Duplicate question detection task. Available in each topic are human annotated duplicate questions (statistics in TableĀ 10 of Appendix) which we partition across train, test and dev as 50%, 40%, 10%. For contrastive training, we add four times as much randomly chosen non-duplicate pairs. The goal is to predict duplicate/not for a question pair, for which we use word mover distance (Kusner etĀ al., 2015, WMD) over adapted word embeddings. We found WMD more accurate than BiMPMĀ (Wang etĀ al., 2017). We use three splits of the target corpus, and for each resultant embedding, measure AUC on three random (train-)dev-test splits of question pairs, for a total of nine runs. For reporting AUC, WMD does not need the train fold.
Medical domain:
This domain from the Ohsumed555https://www.mat.unical.it/OlexSuite/Datasets/SampleDataSets-about.htm dataset has abstracts on cardiovascular diseases. We sample 1.4āMB of abstracts as target corpusĀ . We evaluate embeddings on two tasks: (1)Ā unsupervised language modeling on remaining abstracts, and (2)Ā supervised classification on 23 MeSH classes based on title. We randomly select 10,000 titles with train, test, dev split as 50%, 40%, and 10%. FollowingĀ Joulin etĀ al. (2017), we train a softmax layer on the average of adapted (and pinned) word embeddings.
Topics from 20Ā newsgroup
We choose the five top-level classes in the 20Ā newsgroup dataset666http://qwone.com/~jason/20Newsgroups/ as topics; viz.: Computer, Recreation, Science, Politics, Religion. The corresponding five downstream tasks are text classification over the 3ā5 fine-grained classes under each top-level class. Train, test, dev splits were 50%, 40%, 10%. We average over nine splits. The body text is used as and subject text is used for classification.
Pretrained embeddings are trained on Wikipedia using the default settings of word2vecās CBOW model.
4.1 Effect of fine-tuning embeddings on the target task
We chose to pin embeddings in all our experiments, once adapted to the target corpus, namely the document classification task on medical and 20 newsgroup topics and language model task on five different topics. This is because we did not see any improvements when we unpin the input embeddings. We summarize in TableĀ 1 the results when the embeddings are not pinned on language model task on the four StackExchange topics.
4.2 Epochs vs.Ā regularization results
In FigureĀ 2 we show perplexity and AUC against training epochs. Here we focus on four methods: Tgt, SrcTune, RegFreq, and RegSense. First note that TgtĀ continues to improve on both perplexity and AUC metrics beyond five epochs (the default in word2vec code777https://code.google.com/archive/p/word2vec/ and left unchanged in RegFreq888https://github.com/Victor0118/cross_domain_embedding/ (Yang etĀ al., 2017)). In contrast, SrcTune, RegSense, and RegFreqĀ are much better than TgtĀ at five epochs, saturating quickly. With respect to perplexity, SrcTuneĀ starts getting worse around 20 iterations and becomes identical to Tgt, showing catastrophic forgetting. Regularizers in RegFreqĀ and RegSenseĀ are able to reduce such forgetting, with RegSenseĀ being more effective than RegFreq. These experiments show that any comparison that chooses a fixed number of training epochs across all methods is likely to be unfair. Henceforth we will use a validation set for the stopping criteria. While this is standard practice for supervised tasks, most word embedding code we downloaded ran for a fixed number of epochs, making comparisons unreliable. We conclude that validation-based stopping is critical for fair evaluation.
We next compare SrcTune, RegFreq, and RegSense on the three tasks: perplexity in TableĀ 2, duplicate detection in TableĀ 3, and classification in TableĀ 4. All three methods are better than baselines SrcĀ and Concat, which are much worse than TgtĀ indicating the presence of significant concept drift. Yang etĀ al. (2017) provided no comparison between RegFreq (their method) and SrcTune; we find the latter slightly better. On the supervised tasks, RegFreqĀ is often worse than TgtĀ provided TgtĀ is allowed to train for enough epochs. If the same number of epochs are used to train the two methods, one can reach the misleading conclusion that TgtĀ is worse. RegSenseĀ is better than SrcTuneĀ and RegFreqĀ particularly with respect to perplexity, and rare class classification (TableĀ 4). We conclude that a well-designed word stability-based regularizer can improve upon epoch-based fine-tuning.
Impact of source initialization
TableĀ 5 compares TgtĀ and RegFreqĀ with two initializers: (1)Ā random as proposed by Yang etĀ al. (2017), and (2)Ā with source embeddings. RegFreqĀ after source initialization is better in almost all cases. SrcSelĀ and RegSenseĀ also improve with source initialization, but to a smaller extent. (More detailed numbers are in TableĀ 18 of Appendix.) We conclude that initializing with pretrained embeddings is helpful even with regularizers.
Comparison with Meta-embeddings
In TablesĀ 3 and Ā 4 we show results with the most recent meta-embedding method AAEME. AAEME provides gains over TgtĀ in only two out of six cases999On the topic classification datasets in TableĀ 4, AAEME and its variant DAEME were worse than Src. We used the dev set to select the better of SrcĀ and their best method..
4.3 Performance of SrcSel
We next focus on the performance of SrcSelĀ on all three tasks: perplexity in TableĀ 2, duplicate detection in TableĀ 3, and classification in TableĀ 4. SrcSelĀ is always among the best two methods for perplexity. In supervised tasks, SrcSelĀ is the only method that provides significant gains for all topics: AUC for duplicate detection increases by 2.4%, and classification accuracy increases by 1.4% on average. SrcSel+RegSenseĀ performs even better than SrcSelĀ on all three tasks particularly on rare words. An ablation study on other variants of SrcSelĀ appear in the Appendix.
Word-pair similarity improvements:
In TableĀ 6, we show normalized101010We sample a set of 20 words based on their frequency. Normalized similarity between and is . Set is fixed across methods. cosine similarity of word pairs pertaining to the Physics and Unix topics. Observe how word pairs like (nice, kill), (vim, emacs) in Unix and (current, electron), (lie, group) in Physics are brought closer together as a result of importing the larger unix/physics subset fromĀ . In each of these pairs, words (e.g. nice, vim, lie, current) have a different prominent sense in the source (Wikipedia). Hence, methods like SrcTune, and RegSenseĀ cannot help. In contrast, word pairs like (cost, require), (x-ray, x-rays) whose sense is the same in the two corpus benefit significantly from the source across all methods.
Running time:
SrcSelĀ is five times slower than RegFreq, which is still eminently practical. was within the size of in all domains. If is available, SrcSelĀ is a practical and significantly more accurate option than adapting pretrained source embeddings. SrcSel+RegSenseĀ complements SrcSelĀ on rare words, improves perplexity, and is never worse than SrcSel.
Effect of target corpus size
The problem of importing source embeddings is motivated only when target data is limited. When we increase target corpus 6-fold, the gains of SrcSelĀ and SrcTuneĀ over TgtĀ was insignificant in most cases. However, infrequent classes continued to benefit from the source as shown inĀ TableĀ 7.
4.4 Contextual embeddings
We explore if contextual word embeddings obviate the need for adapting source embeddings, in the ELMo (Peters etĀ al., 2018) setting, a contextualized word representation model, pre-trained on a 5.5B token corpus111111https://allennlp.org/elmo. We compare ELMoās contextual embeddings as-is, and also after concatenating them with each of Tgt, SrcTune, and SrcSelĀ embeddings in TableĀ 8. First, ELMo+TgtĀ is better than TgtĀ and ELMo individually. This shows that contextual embeddings are useful but they do not eliminate the need for topic-sensitive embeddings. Second, ELMo+SrcSelĀ is better than ELMo+Tgt. Although SrcSelĀ is trained on data that is a strict subset of ELMo, it is still instrumental in giving gains since that subset is aligned better with the target sense of words. We conclude that topic-adapted embeddings can be useful, even with ELMo-style contextual embeddings.
Recently, BERTĀ Devlin etĀ al. (2018) has garnered a lot of interest for beating contemporary contextual embeddings on all the GLUE tasks. We evaluate BERT on question duplicate question detection task on the four StackExchange topics. We use pre-trained BERT-base, a smaller 12-layer transformer network, for our experiments. We train a classification layer on the final pooled representation of the sentence pair given by BERT to obtain the binary label of whether they are duplicates. This is unlike the earlier setup where we used EMD on the fixed embeddings.
To evaluate the utility of a relevant topic focused corpus, we fine-tune the pre-trained checkpoint either on (SrcTune) or on (SrcSel:R) using BERTās masked language model loss. The classifier is then initialized with the fine-tuned checkpoint. Since fine-tuning is sensitive to the number of update steps, we tune the number of training steps using performance on a held-out dev set. F1 scores corresponding to different initializing checkpoints are shown in tableĀ 9. It is clear that pre-training the contextual embeddings on relevant target corpus helps in the downstream classification task. However, the gains of SrcSel:RĀ over TgtĀ is not clear. This could be due to incomplete or noisy sentences in . There is need for more experimentation and research to understand the limited gains of SrcSel:RĀ over SrcTuneĀ in the case of BERT. We leave this for future work.
5 Conclusion
We introduced one regularization and one source-selection method for adapting word embeddings from a partly useful source corpus to a target topic. They work better than recent embedding transfer methods, and give benefits even with contextual embeddings. It may be of interest to extend these techniques to embed knowledge graph elements.
Acknowledgment:
Partly supported by an IBM AIĀ Horizon grant. We thank all the anonymous reviewers for their constructive feedback.
Topic Sensitive Attention on Generic Corpora Corrects Sense Bias in Pretrained Embeddings
(Appendix)
Here we present additional results that did not fit into the main paper.
Ablation studies on SrcSel
We compare variants in the design of SrcSelĀ in TablesĀ 13, 14 and Ā 15. In SrcSel:RĀ we run the SrcSel without weighting the source snippets by the score in (6). We observe that the performance is worse than with the score. Next, we check if the score would suffice in down-weighting irrelevant snippets without help from our IR based selection. In SrcSel:cĀ we include 5% random snippets from in addition to those in SrcSelĀ and weigh them all by their score. We find in TableĀ 15 that the performance drops compared to SrcSel. Thus, both the weighting and the IR selection are important components of our source selection method.
Critical hyper-parameters
The number of neighbours used for computing embedding based stability score as shown inĀ (2) is set to 10 for all the tasks. We train each of the different embedding methods for a range of different epochs: {5, 20, 80, 160, 200, 250}. The parameter of RegSenseĀ and RegFreqĀ is tuned over {0.1, 1, 10, 50}. Pre-trained embeddings are obtained by training a CBOW model for 5 epochs on a cleaned version of 20160901 dump of Wikipedia. All the embedding sizes are set to .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Baeza-Yates and Ribeiro-Neto (1999) Ricardo A. Baeza-Yates and Berthier Ribeiro-Neto. 1999. Modern Information Retrieval . Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.
- 2Barnes et al. (2018) Jeremy Barnes, Roman Klinger, and Sabine Schulte im Walde. 2018. Projecting embeddings for domain adaptation: Joint modeling of sentiment analysis in diverse domains. In COLING .
- 3Blitzer et al. (2006) John Blitzer, Ryan Mc Donald, and Fernando Pereira. 2006. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 conference on empirical methods in natural language processing , pages 120ā128. Association for Computational Linguistics.
- 4Bollegala and Bao (2018) Danushka Bollegala and Cong Bao. 2018. Learning word meta-embeddings by autoencoding. In Proceedings of the 27th International Conference on Computational Linguistics , pages 1650ā1661.
- 5Bollegala et al. (2015) Danushka Bollegala, Takanori Maehara, and Ken-ichi Kawarabayashi. 2015. Unsupervised cross-domain word representation learning. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL .
- 6Bollegala et al. (2014) Danushka Bollegala, David J. Weir, and John A. Carroll. 2014. Learning to predict distributions of words across domains. In ACL .
- 7Cer et al. (2018) Daniel Cer et al. 2018. Universal sentence encoder . Co RR , abs/1803.11175.
- 8Coates and Bollegala (2018) Joshua Coates and Danushka Bollegala. 2018. Frustratingly easy meta-embedding - computing meta-embeddings by averaging source word embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers) , pages 194ā198.
