A Systematic Comparison of English Noun Compound Representations

Vered Shwartz

arXiv:1906.04772·cs.CL·June 13, 2019

A Systematic Comparison of English Noun Compound Representations

Vered Shwartz

PDF

1 Repo

TL;DR

This paper systematically compares different noun compound representation methods, finding that compositional functions generally outperform distributional ones and that combining approaches could yield better results.

Contribution

It provides a comprehensive comparison of noun compound representations, highlighting the effectiveness of composition functions and suggesting joint training for improved performance.

Findings

01

Composition functions outperform distributional representations in most cases.

02

Representation quality improves with increased computational power.

03

No single function is best for all scenarios, indicating potential for joint training.

Abstract

Building meaningful representations of noun compounds is not trivial since many of them scarcely appear in the corpus. To that end, composition functions approximate the distributional representation of a noun compound by combining its constituent distributional vectors. In the more general case, phrase embeddings have been trained by minimizing the distance between the vectors representing paraphrases. We compare various types of noun compound representations, including distributional, compositional, and paraphrase-based representations, through a series of tasks and analyses, and with an extensive number of underlying word embeddings. We find that indeed, in most cases, composition functions produce higher quality representations than distributional ones, and they improve with computational power. No single function performs best in all scenarios, suggesting that a joint training…

Tables6

Table 1. Table 1: Top 5 nearest neighbour of two example noun compounds, syndicate representative (1 corpus occurrence) and army officer (13,924 occurrences) in each composition function. DSM = (word2vec SG, window 5, 300d).

syndicate representative (rare)
Distributional
geloios
t.franse
adopter(s
ahchie
anquish
Compositional
Add	FullAdd	Matrix	LSTM
syndicate	syndicate	f(student, representative)	f(worker, representative)
representative	f(deputy, representative)	syndicate	f(player, representative)
f(worker, representative)	f(student, representative)	f(deputy, representative)	f(crack, dealer)
f(deputy, representative)	f(player, representative)	f(worker, representative)	f(company, spokesman)
f(student, representative)	f(worker, representative)	f(player, representative)	f(industry, commissioner)
Paraphrase-based
Co-occurrence	Backtranslation
f(company, representative)	f(worker, representative)
f(phone, representative)	f(union, representative)
f(union, representative)	f(group, manager)
f(marketing, representative)	f(employee, representative)
f(labor, representative)	f(student, representative)
army officer (frequent)
Distributional
army_captain
army_major
navy_officer
army_general
army_lieutenant
Compositional
Add	FullAdd	Matrix	LSTM
army	f(police, commander)	f(police, commander)	f(militia, commander)
officer	f(army, troop)	army_officer	f(police, commander)
f(army, battalion)	f(militia, commander)	f(army, troop)	f(opposition, commander)
f(army, troop)	f(army, camp)	army_general	f(military, official)
f(army, building)	army_officer	f(army, camp)	f(comrade, commander)
Paraphrase-based
Co-occurrence	Backtranslation
	f(patrol, officer)	f(army, official)
	f(navy, officer)	f(military, spokesman)
	f(prison, officer)	f(army, lieutenant)
	f(fire, officer)	f(army, chief)
	f(police, officer)	f(army, spokesman)

Table 2. Table 2: Mean and standard deviation of F 1 subscript 𝐹 1 F_{1} scores across DSMs, for each representation and property combination. The majority baseline F 1 subscript 𝐹 1 F_{1} score is 0 for all properties, since it always predicts False.

Representation	Used for transportation	Is a weapon	Is round	Has various colors	Made of metal
Distributional	$48.0 \pm 12.6$	$57.3 \pm 14.8$	$24.8 \pm 8.9$	$42.0 \pm 12.5$	$41.3 \pm 12.0$
Add	$55.8 \pm 13.5$	$30.3 \pm 20.1$	$46.2 \pm 13.2$	$41.8 \pm 13.1$	$55.1 \pm 14.1$
FullAdd	$55.9 \pm 13.4$	$36.8 \pm 17.3$	$44.0 \pm 13.0$	$48.2 \pm 12.7$	$52.2 \pm 13.0$
Matrix	$56.5 \pm 13.9$	$24.0 \pm 19.1$	$43.8 \pm 13.4$	$49.5 \pm 13.3$	$52.0 \pm 12.9$
LSTM	$48.3 \pm 15.8$	$0.0 \pm 0.0$	$21.7 \pm 17.5$	$37.2 \pm 18.4$	$42.1 \pm 18.6$
Co-occurrence	$64.2 \pm 14.9$	$40.5 \pm 30.1$	$47.0 \pm 13.0$	$56.9 \pm 12.8$	$57.6 \pm 12.9$
Backtranslation	$58.3 \pm 14.1$	$54.0 \pm 19.5$	$42.1 \pm 13.5$	$52.4 \pm 13.5$	$57.4 \pm 13.1$

Table 3. Table 3: The performance of the best setting for each property.

Feature	Representation	Embedding	Window	Dimension	Precision	Recall	$𝐅_{𝟏}$
Used for transportation	Co-occurrence	word2vec SG	10	300	$74.5$	$78.8$	$76.6$
Is a weapon	Backtranslation	word2vec CBOW	2	300	$71.4$	$88.2$	$78.9$
Is round	Co-occurrence	word2vec CBOW	10	300	$56.2$	$87.1$	$68.4$
Has various colors	Co-occurrence	GloVe	2	200	$70.6$	$76.6$	$73.5$
Made of metal	Matrix	word2vec SG	5	300	$78.6$	$61.1$	$68.8$

Table 4. Table 4: Mean and standard deviation of F 1 subscript 𝐹 1 F_{1} scores across word embeddings, windows and dimensions, for each composition function and dataset combination.

Representation

Coarse-grained

Random

Coarse-grained

Lexical

Fine-grained

Random

Fine-grained

Lexical

Distributional

44.0 \pm 11.5

30.5 \pm 8.5

40.8 \pm 12.5

24.7 \pm 6.5

Add

51.9 \pm 10.5

34.7 \pm 7.3

51.5 \pm 10.9

30.7 \pm 5.9

FullAdd

54.5 \pm 10.7

35.7 \pm 8.0

53.5 \pm 11.0

28.8 \pm 6.8

Matrix

49.1 \pm 11.3

32.6 \pm 8.1

47.3 \pm 12.1

26.7 \pm 7.2

LSTM

54.0 \pm 11.8

37.5 \pm 8.2

52.1 \pm 11.9

30.9 \pm 6.6

Co-occurrence

49.8 \pm 9.7

31.4 \pm 7.1

47.7 \pm 10.6

24.6 \pm 6.0

Backtranslation

47.2 \pm 7.7

33.5 \pm 6.1

44.6 \pm 8.5

26.7 \pm 5.1

Table 5. Table 5: The performance of the best setting for each noun compound relation classification dataset.

Dataset	Representation	Embedding	Window	Dimension	Precision	Recall	$𝐅_{𝟏}$
Coarse-grained Random	LSTM	Fasttext SG	2	300	$66.5$	$66.7$	$66.2$
Coarse-grained Lexical	LSTM	Fasttext SG	2	200	$50.2$	$49.0$	$47.5$
Fine-grained Random	LSTM	Fasttext SG	2	300	$64.6$	$65.3$	$63.9$
Fine-grained Lexical	Matrix	word2vec SG	2	100	$39.6$	$39.8$	$38.1$

cause
	experiencer-of-experience	company strategy
purpose
	purpose	labor market
	create-provide-generate-sell	aid center
	mitigate&oppose	fishing quota
	perform&engage_in	acquisition fund
	organize&supervise&authority	fire commissioner
time
	time-of1	fourth-quarter income
	time-of2	rating period
loc_part_whole
	location	water spider
	whole+part_or_member_of	society member
attribute
	equative	winter season
	adj-like_noun	core tradition
	partial_attribute_transfer	lemon soda
other
	measure	percentage change
	lexicalized	action hero
	other	trade conflict
objective
	objective	biotechnology research
causal
	subject	government figure
	justification	genocide trial
	creator-provider-cause_of	refining margin
	means	car bombing
complement
	relational-noun-complement	police power
	whole+attribute&feature&quality_value_is_characteristic_of	earth tone
containment
	part&member_of_collection&config&series	stock portfolio
	contain	studio lot
	variety&genus_of	tuberculosis strain
	amount-of	work load
	substance-material-ingredient	cedar chalet
owner_emp_use
	user_recipient	subway platform
	employer	government technocrat
	owner-user	government surplus
topical
	personal_name	Sarah Boyle
	topic_of_cognition&emotion	security fear
	topic_of_expert	cancer expert
	obtain&access&seek	finance plan
	personal_title	Minister Kennedy
	topic	property deal

Equations2

\vspace * - 5 pt ma x (0, λ - cos (v_{N C}, v_{p}) + cos (v_{N C}, v_{p^{'}}))

\vspace * - 5 pt ma x (0, λ - cos (v_{N C}, v_{p}) + cos (v_{N C}, v_{p^{'}}))

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vered1986/NC_Embeddings
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Systematic Comparison of English Noun Compound Representations

Vered Shwartz

Computer Science Department, Bar-Ilan University, Ramat-Gan, Israel

[email protected]

Abstract

Building meaningful representations of noun compounds is not trivial since many of them scarcely appear in the corpus. To that end, composition functions approximate the distributional representation of a noun compound by combining its constituent distributional vectors. In the more general case, phrase embeddings have been trained by minimizing the distance between the vectors representing paraphrases. We compare various types of noun compound representations, including distributional, compositional, and paraphrase-based representations, through a series of tasks and analyses, and with an extensive number of underlying word embeddings. We find that indeed, in most cases, composition functions produce higher quality representations than distributional ones, and they improve with computational power. No single function performs best in all scenarios, suggesting that a joint training objective may produce improved representations.

1 Introduction

The simplest way to obtain a vector representation for a multiword term is to treat it as a single token, e.g. by replacing spaces with underscores, and train a standard word embedding algorithm. This is typically done for common n-grams, which often include named entities (e.g. New York), but in theory can also be based on syntactic criteria, for instance in order to learn noun compound vectors. The main issue with this approach is that word embedding algorithms require sufficient term frequency to obtain meaningful representations, and many noun compounds rarely occur in text corpora Kim and Baldwin (2006).

To overcome the sparsity issue, it is common to learn a composition function which computes a noun compound vector from its constituents’ distributional representations, e.g. vec(cost estimate) = f(vec(cost), vec(estimate)). Various functions have been proposed in the literature, typically based on vector arithmetics (e.g. Mitchell and Lapata, 2010; Zanzotto et al., 2010; Dinu et al., 2013). Such functions are learned with the objective of minimizing the distance between the observed (distributional) vector and the composed vector of each noun compound, and most functions are limited to binary noun compounds.

A parallel line of work computes phrase embeddings for variable-length phrases, by adapting the word embedding training objective Poliak et al. (2017) or by minimizing the distance between the representations of paraphrases Wieting et al. (2016); Wieting and Gimpel (2017); Wieting et al. (2017). Paraphrase-based phrase embeddings require a large number of paraphrases as training instances. Such paraphrases are often generated by translating an English phrase into a foreign language and back to English, considering variations in translation as paraphrases. This technique is referred to as “bilingual pivoting” or “backtranslation” Barzilay and McKeown (2001); Bannard and Callison-Burch (2005); Ganitkevitch et al. (2013); Mallinson et al. (2017).

In this work we test the quality of noun compound representations produced by different methods, including distributional representations, composition functions, and paraphrase-based phrase embeddings. We extend the work of Dima (2016), who evaluated various composition functions on the noun compound relation classification task, in several aspects. First, we test a broader range of representations, which may differ both in their architectures and in their training objectives. Second, we train each representation with a wide variety of underlying word embeddings, and analyze the representation’s behaviour across the different word embeddings. Finally, we use several tasks to evaluate the representation quality: relation classification (what is the relationship between the constituents?), property classification (is a cheese wheel round?), as well as a qualitative and quantitative analysis of the nearest neighbours. The results confirm that the distributional representations of rare noun compounds are indeed of low quality. Across representations, the nearest neighbours of a target noun compound vector typically include many trivial similarities such as other noun compounds with a shared constituent.

Among the composition functions, functions with more computational power and parameters generally produced higher quality representations. The paraphrase-based functions outperformed the others in the property prediction task, while the compositional functions performed better on relation classification. The results suggest that learning a composition function with a combined training objective is a promising research direction that may result in improved noun compound representations.111The code and data is available at https://github.com/vered1986/NC_Embeddings.

2 Representations

We trained 315 distributional semantic models (DSMs) that differ by their training objective (Section 2.1) and the underlying embeddings used for the constituent nouns (Section 2.2).

2.1 Training Objective

Distributional.

This approach simply treats a noun compound as a single token w1_w2, and learns standard word embeddings for the words and noun compounds in the corpus.

Compositional.

We learn a function $f(\cdot,\cdot):\mathcal{R}^{d}\times\mathcal{R}^{d}\rightarrow\mathcal{R}^{d}$ which, for a given noun compound, operates on the word embeddings of its constituent nouns, and returns a vector representing the compound. Following Dima (2016) and earlier work, the training objective is to minimize the distance between the observed distributional embedding $\vec{v}_{w_{1}\_w_{2}}$ and the composed vector $f(\vec{v}_{w_{1}},\vec{v}_{w_{2}})$ .

We train the following composition functions:

•

Add Mitchell and Lapata (2010): $f(\vec{v}_{w_{1}},\vec{v}_{w_{2}})=$ $\alpha\vec{v}_{w_{1}}+\beta\vec{v}_{w_{2}}$ , $\alpha,\beta$ are scalars.

•

FullAdd Zanzotto et al. (2010); Dinu et al. (2013): $f(\vec{v}_{w_{1}},\vec{v}_{w_{2}})=$ $W_{1}\vec{v}_{w_{1}}+W_{2}\vec{v}_{w_{2}}$ , where $W_{1},W_{2}\in\mathcal{R}^{d\times d}$ are matrices.

•

Matrix Dima (2016): $f(\vec{v}_{w_{1}},\vec{v}_{w_{2}})=tanh(W\cdot[\vec{v}_{w_{1}};\vec{v}_{w_{2}}])$ , where $W\in\mathcal{R}^{2d\times d}$ . This is the application of the recursive matrix-vector method of Socher et al. (2012) to binary phrases.222Originally, this method was trained with an extrinsic training objective of sentiment analysis.

•

LSTM: encoding the compound with a long short-term memory network (LSTM; Hochreiter and Schmidhuber, 1997): $f(\vec{v}_{w_{1}},\vec{v}_{w_{2}})=LSTM(\vec{v}_{w_{1}},\vec{v}_{w_{2}})$ .

Paraphrase-based.

In this approach we follow the literature of paraphrase-based phrase embeddings (e.g. Wieting et al., 2016, 2017). We generate paraphrases for each noun compound, and train the function with the objective of producing similar vectors to the noun compound and its paraphrase.

To obtain the representation of a phrase (either a noun compound or its variable-length paraphrase), we encode it with an LSTM. For a given noun compound NC = w1 w2 and its paraphrase $p$ , we set the loss to:

[TABLE]

where $v_{x}=LSTM(x)$ is the encoding of phrase x, p’ is a negative-sampled paraphrase, and $\lambda$ was set to 0.6 based on its value in Wieting et al. (2016). The following approaches were used to obtain the paraphrases:

•

Backtranslation: We translate each noun compound to foreign language(s) and back to English, as in Wieting et al. (2017). Specifically, we use the DeepL Translator web interface,333https://www.deepl.com performing translation from English to 4 different foreign languages (French, Italian, Spanish, and Romanian) and back to English. We focused on Romance languages because they translate English noun compounds to noun phrases with prepositions Girju (2007), and we were hoping that this would drive the backtranslation to be more explicit. For example, baby oil is translated in French to huile pour bébé, which literally means oil for baby. In practice, translating back to English mostly generates paraphrases which are other noun compounds (synonyms or related terms), rather than prepositional paraphrases.

We use all the suggested translations to generate a large list of paraphrases for each noun compound, but we apply two filters. First, we trivially remove the noun compound itself from its list of paraphrases. Second, the translation sometimes yields non-English phrases (a result of an error in the translation), which we automatically identify and remove using a language identification tool.444https://pypi.org/project/guess_language-spirit/ After filtering around half of the paraphrases, we remain with an average number of 6.71 paraphrases per compound.

•

Co-occurrence: We treat the frequent joint occurrences of w1 and w2 in a corpus as paraphrases, e.g. apple cake may yield a paraphrase like “cake made of apples”. Specifically, we use the paraphrases obtained by Shwartz and Dagan (2018) from the Google N-gram corpus Brants and Franz (2006). The paraphrases are of variable length (3-5 words), and have been pre-processed to remove punctuation, adjectives, adverbs and determiners. The averaged number of paraphrases per compound is 9.18.

2.2 Constituent Word Embeddings

To represent the constituent words, we trained various word embedding algorithms: word2vec Mikolov et al. (2013) and fastText Bojanowski et al. (2017), which extends word2vec by adding subword information. We used both the Skip-Gram objective (which predicts the context words given the target word) and the CBOW objective (continuous bag-of-words, predicting the target word from its context).555We used the Gensim implementation: https://radimrehurek.com/gensim/ We also trained the GloVe algorithm Pennington et al. (2014), which estimates the log-probability of a word pair co-occurrence. All the embeddings were trained on the English Wikipedia dump from January 2018, with various values for the window size (2, 5, 10) and the embedding dimension (100, 200, 300).

2.3 Implementation Details

We implemented the models using the AllenNLP library Gardner et al. (2018) which is based on the PyTorch framework Paszke et al. (2017). To train the DSMs we used the list of 18,856 compositional noun compounds from Tratz (2011).666Omitting 351 noun compounds belonging to the lexicalized, personal_name, and personal_title classes. We only used binary noun compounds, i.e. consisting of exactly two constituent nouns, and we split them to 80% train, 10% test, and 10% validation sets.

For the sake of simplicity, for the remainder of the paper we will refer to the training objective and architecture combination as the “representation”, and a trained instance of the representation, with a choice of underlying word embeddings (algorithm, dimension, and window), as a DSM.

3 Experiments

We compare the various representations in 3 experiments: an analysis of the nearest neighbours of each noun compound vector (Section 3.1), an evaluation on property prediction (Section 3.2), and an evaluation on noun compound relation classification (Section 3.3).

3.1 Nearest Neighbour Analysis

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bannard and Callison-Burch (2005) Colin Bannard and Chris Callison-Burch. 2005. Paraphrasing with bilingual parallel corpora . In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05) , pages 597–604, Ann Arbor, Michigan. Association for Computational Linguistics. · doi ↗
2Barzilay and Mc Keown (2001) Regina Barzilay and R. Kathleen Mc Keown. 2001. Extracting paraphrases from a parallel corpus . In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics .
3Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics , 5:135–146.
4Boleda et al. (2013) Gemma Boleda, Marco Baroni, The Nghia Pham, and Louise Mc Nally. 2013. Intensionality was only alleged: On Adjective-noun composition in distributional semantics . In Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) – Long Papers , pages 35–46, Potsdam, Germany. Association for Computational Linguistics.
5Brants and Franz (2006) Thorsten Brants and Alex Franz. 2006. Web 1t 5-gram version 1.
6Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , Minneapolis, Minnesota. Association for Computational Linguistics.
7Dima (2016) Corina Dima. 2016. On the compositionality and semantic interpretation of english noun compounds . In Proceedings of the 1st Workshop on Representation Learning for NLP , pages 27–39. Association for Computational Linguistics. · doi ↗
8Dinu et al. (2013) Georgiana Dinu, Nghia The Pham, and Marco Baroni. 2013. General estimation and evaluation of compositional distributional semantic models . In Proceedings of the Workshop on Continuous Vector Space Models and their Compositionality , pages 50–58, Sofia, Bulgaria. Association for Computational Linguistics.