TL;DR
This paper systematically compares different noun compound representation methods, finding that compositional functions generally outperform distributional ones and that combining approaches could yield better results.
Contribution
It provides a comprehensive comparison of noun compound representations, highlighting the effectiveness of composition functions and suggesting joint training for improved performance.
Findings
Composition functions outperform distributional representations in most cases.
Representation quality improves with increased computational power.
No single function is best for all scenarios, indicating potential for joint training.
Abstract
Building meaningful representations of noun compounds is not trivial since many of them scarcely appear in the corpus. To that end, composition functions approximate the distributional representation of a noun compound by combining its constituent distributional vectors. In the more general case, phrase embeddings have been trained by minimizing the distance between the vectors representing paraphrases. We compare various types of noun compound representations, including distributional, compositional, and paraphrase-based representations, through a series of tasks and analyses, and with an extensive number of underlying word embeddings. We find that indeed, in most cases, composition functions produce higher quality representations than distributional ones, and they improve with computational power. No single function performs best in all scenarios, suggesting that a joint training…
| syndicate representative (rare) | |||
|---|---|---|---|
| Distributional | |||
| geloios | |||
| t.franse | |||
| adopter(s | |||
| ahchie | |||
| anquish | |||
| Compositional | |||
| Add | FullAdd | Matrix | LSTM |
| syndicate | syndicate | f(student, representative) | f(worker, representative) |
| representative | f(deputy, representative) | syndicate | f(player, representative) |
| f(worker, representative) | f(student, representative) | f(deputy, representative) | f(crack, dealer) |
| f(deputy, representative) | f(player, representative) | f(worker, representative) | f(company, spokesman) |
| f(student, representative) | f(worker, representative) | f(player, representative) | f(industry, commissioner) |
| Paraphrase-based | |||
| Co-occurrence | Backtranslation | ||
| f(company, representative) | f(worker, representative) | ||
| f(phone, representative) | f(union, representative) | ||
| f(union, representative) | f(group, manager) | ||
| f(marketing, representative) | f(employee, representative) | ||
| f(labor, representative) | f(student, representative) | ||
| army officer (frequent) | |||
| Distributional | |||
| army_captain | |||
| army_major | |||
| navy_officer | |||
| army_general | |||
| army_lieutenant | |||
| Compositional | |||
| Add | FullAdd | Matrix | LSTM |
| army | f(police, commander) | f(police, commander) | f(militia, commander) |
| officer | f(army, troop) | army_officer | f(police, commander) |
| f(army, battalion) | f(militia, commander) | f(army, troop) | f(opposition, commander) |
| f(army, troop) | f(army, camp) | army_general | f(military, official) |
| f(army, building) | army_officer | f(army, camp) | f(comrade, commander) |
| Paraphrase-based | |||
| Co-occurrence | Backtranslation | ||
| f(patrol, officer) | f(army, official) | ||
| f(navy, officer) | f(military, spokesman) | ||
| f(prison, officer) | f(army, lieutenant) | ||
| f(fire, officer) | f(army, chief) | ||
| f(police, officer) | f(army, spokesman) | ||
| Representation | Used for transportation | Is a weapon | Is round | Has various colors | Made of metal |
|---|---|---|---|---|---|
| Distributional | |||||
| Add | |||||
| FullAdd | |||||
| Matrix | |||||
| LSTM | |||||
| Co-occurrence | |||||
| Backtranslation |
| Feature | Representation | Embedding | Window | Dimension | Precision | Recall | |
|---|---|---|---|---|---|---|---|
| Used for transportation | Co-occurrence | word2vec SG | 10 | 300 | |||
| Is a weapon | Backtranslation | word2vec CBOW | 2 | 300 | |||
| Is round | Co-occurrence | word2vec CBOW | 10 | 300 | |||
| Has various colors | Co-occurrence | GloVe | 2 | 200 | |||
| Made of metal | Matrix | word2vec SG | 5 | 300 |
| Representation |
|
|
|
|
||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Distributional | ||||||||||||
| Add | ||||||||||||
| FullAdd | ||||||||||||
| Matrix | ||||||||||||
| LSTM | ||||||||||||
| Co-occurrence | ||||||||||||
| Backtranslation |
| Dataset | Representation | Embedding | Window | Dimension | Precision | Recall | |
|---|---|---|---|---|---|---|---|
| Coarse-grained Random | LSTM | Fasttext SG | 2 | 300 | |||
| Coarse-grained Lexical | LSTM | Fasttext SG | 2 | 200 | |||
| Fine-grained Random | LSTM | Fasttext SG | 2 | 300 | |||
| Fine-grained Lexical | Matrix | word2vec SG | 2 | 100 |
| cause | ||
| experiencer-of-experience | company strategy | |
| purpose | ||
| purpose | labor market | |
| create-provide-generate-sell | aid center | |
| mitigate&oppose | fishing quota | |
| perform&engage_in | acquisition fund | |
| organize&supervise&authority | fire commissioner | |
| time | ||
| time-of1 | fourth-quarter income | |
| time-of2 | rating period | |
| loc_part_whole | ||
| location | water spider | |
| whole+part_or_member_of | society member | |
| attribute | ||
| equative | winter season | |
| adj-like_noun | core tradition | |
| partial_attribute_transfer | lemon soda | |
| other | ||
| measure | percentage change | |
| lexicalized | action hero | |
| other | trade conflict | |
| objective | ||
| objective | biotechnology research | |
| causal | ||
| subject | government figure | |
| justification | genocide trial | |
| creator-provider-cause_of | refining margin | |
| means | car bombing | |
| complement | ||
| relational-noun-complement | police power | |
| whole+attribute&feature&quality_value_is_characteristic_of | earth tone | |
| containment | ||
| part&member_of_collection&config&series | stock portfolio | |
| contain | studio lot | |
| variety&genus_of | tuberculosis strain | |
| amount-of | work load | |
| substance-material-ingredient | cedar chalet | |
| owner_emp_use | ||
| user_recipient | subway platform | |
| employer | government technocrat | |
| owner-user | government surplus | |
| topical | ||
| personal_name | Sarah Boyle | |
| topic_of_cognition&emotion | security fear | |
| topic_of_expert | cancer expert | |
| obtain&access&seek | finance plan | |
| personal_title | Minister Kennedy | |
| topic | property deal | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A Systematic Comparison of English Noun Compound Representations
Vered Shwartz
Computer Science Department, Bar-Ilan University, Ramat-Gan, Israel
Abstract
Building meaningful representations of noun compounds is not trivial since many of them scarcely appear in the corpus. To that end, composition functions approximate the distributional representation of a noun compound by combining its constituent distributional vectors. In the more general case, phrase embeddings have been trained by minimizing the distance between the vectors representing paraphrases. We compare various types of noun compound representations, including distributional, compositional, and paraphrase-based representations, through a series of tasks and analyses, and with an extensive number of underlying word embeddings. We find that indeed, in most cases, composition functions produce higher quality representations than distributional ones, and they improve with computational power. No single function performs best in all scenarios, suggesting that a joint training objective may produce improved representations.
1 Introduction
The simplest way to obtain a vector representation for a multiword term is to treat it as a single token, e.g. by replacing spaces with underscores, and train a standard word embedding algorithm. This is typically done for common n-grams, which often include named entities (e.g. New York), but in theory can also be based on syntactic criteria, for instance in order to learn noun compound vectors. The main issue with this approach is that word embedding algorithms require sufficient term frequency to obtain meaningful representations, and many noun compounds rarely occur in text corpora Kim and Baldwin (2006).
To overcome the sparsity issue, it is common to learn a composition function which computes a noun compound vector from its constituents’ distributional representations, e.g. vec(cost estimate) = f(vec(cost), vec(estimate)). Various functions have been proposed in the literature, typically based on vector arithmetics (e.g. Mitchell and Lapata, 2010; Zanzotto et al., 2010; Dinu et al., 2013). Such functions are learned with the objective of minimizing the distance between the observed (distributional) vector and the composed vector of each noun compound, and most functions are limited to binary noun compounds.
A parallel line of work computes phrase embeddings for variable-length phrases, by adapting the word embedding training objective Poliak et al. (2017) or by minimizing the distance between the representations of paraphrases Wieting et al. (2016); Wieting and Gimpel (2017); Wieting et al. (2017). Paraphrase-based phrase embeddings require a large number of paraphrases as training instances. Such paraphrases are often generated by translating an English phrase into a foreign language and back to English, considering variations in translation as paraphrases. This technique is referred to as “bilingual pivoting” or “backtranslation” Barzilay and McKeown (2001); Bannard and Callison-Burch (2005); Ganitkevitch et al. (2013); Mallinson et al. (2017).
In this work we test the quality of noun compound representations produced by different methods, including distributional representations, composition functions, and paraphrase-based phrase embeddings. We extend the work of Dima (2016), who evaluated various composition functions on the noun compound relation classification task, in several aspects. First, we test a broader range of representations, which may differ both in their architectures and in their training objectives. Second, we train each representation with a wide variety of underlying word embeddings, and analyze the representation’s behaviour across the different word embeddings. Finally, we use several tasks to evaluate the representation quality: relation classification (what is the relationship between the constituents?), property classification (is a cheese wheel round?), as well as a qualitative and quantitative analysis of the nearest neighbours. The results confirm that the distributional representations of rare noun compounds are indeed of low quality. Across representations, the nearest neighbours of a target noun compound vector typically include many trivial similarities such as other noun compounds with a shared constituent.
Among the composition functions, functions with more computational power and parameters generally produced higher quality representations. The paraphrase-based functions outperformed the others in the property prediction task, while the compositional functions performed better on relation classification. The results suggest that learning a composition function with a combined training objective is a promising research direction that may result in improved noun compound representations.111The code and data is available at https://github.com/vered1986/NC_Embeddings.
2 Representations
We trained 315 distributional semantic models (DSMs) that differ by their training objective (Section 2.1) and the underlying embeddings used for the constituent nouns (Section 2.2).
2.1 Training Objective
Distributional.
This approach simply treats a noun compound as a single token w1_w2, and learns standard word embeddings for the words and noun compounds in the corpus.
Compositional.
We learn a function which, for a given noun compound, operates on the word embeddings of its constituent nouns, and returns a vector representing the compound. Following Dima (2016) and earlier work, the training objective is to minimize the distance between the observed distributional embedding and the composed vector .
We train the following composition functions:
- •
Add Mitchell and Lapata (2010): , are scalars.
- •
FullAdd Zanzotto et al. (2010); Dinu et al. (2013): , where are matrices.
- •
Matrix Dima (2016): , where . This is the application of the recursive matrix-vector method of Socher et al. (2012) to binary phrases.222Originally, this method was trained with an extrinsic training objective of sentiment analysis.
- •
LSTM: encoding the compound with a long short-term memory network (LSTM; Hochreiter and Schmidhuber, 1997): .
Paraphrase-based.
In this approach we follow the literature of paraphrase-based phrase embeddings (e.g. Wieting et al., 2016, 2017). We generate paraphrases for each noun compound, and train the function with the objective of producing similar vectors to the noun compound and its paraphrase.
To obtain the representation of a phrase (either a noun compound or its variable-length paraphrase), we encode it with an LSTM. For a given noun compound NC = w1 w2 and its paraphrase , we set the loss to:
[TABLE]
where is the encoding of phrase x, p’ is a negative-sampled paraphrase, and was set to 0.6 based on its value in Wieting et al. (2016). The following approaches were used to obtain the paraphrases:
- •
Backtranslation: We translate each noun compound to foreign language(s) and back to English, as in Wieting et al. (2017). Specifically, we use the DeepL Translator web interface,333https://www.deepl.com performing translation from English to 4 different foreign languages (French, Italian, Spanish, and Romanian) and back to English. We focused on Romance languages because they translate English noun compounds to noun phrases with prepositions Girju (2007), and we were hoping that this would drive the backtranslation to be more explicit. For example, baby oil is translated in French to huile pour bébé, which literally means oil for baby. In practice, translating back to English mostly generates paraphrases which are other noun compounds (synonyms or related terms), rather than prepositional paraphrases.
We use all the suggested translations to generate a large list of paraphrases for each noun compound, but we apply two filters. First, we trivially remove the noun compound itself from its list of paraphrases. Second, the translation sometimes yields non-English phrases (a result of an error in the translation), which we automatically identify and remove using a language identification tool.444https://pypi.org/project/guess_language-spirit/ After filtering around half of the paraphrases, we remain with an average number of 6.71 paraphrases per compound.
- •
Co-occurrence: We treat the frequent joint occurrences of w1 and w2 in a corpus as paraphrases, e.g. apple cake may yield a paraphrase like “cake made of apples”. Specifically, we use the paraphrases obtained by Shwartz and Dagan (2018) from the Google N-gram corpus Brants and Franz (2006). The paraphrases are of variable length (3-5 words), and have been pre-processed to remove punctuation, adjectives, adverbs and determiners. The averaged number of paraphrases per compound is 9.18.
2.2 Constituent Word Embeddings
To represent the constituent words, we trained various word embedding algorithms: word2vec Mikolov et al. (2013) and fastText Bojanowski et al. (2017), which extends word2vec by adding subword information. We used both the Skip-Gram objective (which predicts the context words given the target word) and the CBOW objective (continuous bag-of-words, predicting the target word from its context).555We used the Gensim implementation: https://radimrehurek.com/gensim/ We also trained the GloVe algorithm Pennington et al. (2014), which estimates the log-probability of a word pair co-occurrence. All the embeddings were trained on the English Wikipedia dump from January 2018, with various values for the window size (2, 5, 10) and the embedding dimension (100, 200, 300).
2.3 Implementation Details
We implemented the models using the AllenNLP library Gardner et al. (2018) which is based on the PyTorch framework Paszke et al. (2017). To train the DSMs we used the list of 18,856 compositional noun compounds from Tratz (2011).666Omitting 351 noun compounds belonging to the lexicalized, personal_name, and personal_title classes. We only used binary noun compounds, i.e. consisting of exactly two constituent nouns, and we split them to 80% train, 10% test, and 10% validation sets.
For the sake of simplicity, for the remainder of the paper we will refer to the training objective and architecture combination as the “representation”, and a trained instance of the representation, with a choice of underlying word embeddings (algorithm, dimension, and window), as a DSM.
3 Experiments
We compare the various representations in 3 experiments: an analysis of the nearest neighbours of each noun compound vector (Section 3.1), an evaluation on property prediction (Section 3.2), and an evaluation on noun compound relation classification (Section 3.3).
3.1 Nearest Neighbour Analysis
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Bannard and Callison-Burch (2005) Colin Bannard and Chris Callison-Burch. 2005. Paraphrasing with bilingual parallel corpora . In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05) , pages 597–604, Ann Arbor, Michigan. Association for Computational Linguistics. · doi ↗
- 2Barzilay and Mc Keown (2001) Regina Barzilay and R. Kathleen Mc Keown. 2001. Extracting paraphrases from a parallel corpus . In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics .
- 3Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics , 5:135–146.
- 4Boleda et al. (2013) Gemma Boleda, Marco Baroni, The Nghia Pham, and Louise Mc Nally. 2013. Intensionality was only alleged: On Adjective-noun composition in distributional semantics . In Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) – Long Papers , pages 35–46, Potsdam, Germany. Association for Computational Linguistics.
- 5Brants and Franz (2006) Thorsten Brants and Alex Franz. 2006. Web 1t 5-gram version 1.
- 6Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , Minneapolis, Minnesota. Association for Computational Linguistics.
- 7Dima (2016) Corina Dima. 2016. On the compositionality and semantic interpretation of english noun compounds . In Proceedings of the 1st Workshop on Representation Learning for NLP , pages 27–39. Association for Computational Linguistics. · doi ↗
- 8Dinu et al. (2013) Georgiana Dinu, Nghia The Pham, and Marco Baroni. 2013. General estimation and evaluation of compositional distributional semantic models . In Proceedings of the Workshop on Continuous Vector Space Models and their Compositionality , pages 50–58, Sofia, Bulgaria. Association for Computational Linguistics.
