Improving Semantic Composition with Offset Inference

Thomas Kober; Julie Weeds; Jeremy Reffin; David Weir

arXiv:1704.06692·cs.CL·April 25, 2017

Improving Semantic Composition with Offset Inference

Thomas Kober, Julie Weeds, Jeremy Reffin, David Weir

PDF

1 Repo

TL;DR

This paper introduces a novel distributional inference method leveraging type structures in Anchored Packed Trees to address sparsity in semantic models, enhancing their ability to infer plausible co-occurrences.

Contribution

It presents a new inference technique that exploits type information in APTs to improve semantic composition and reduce data sparsity issues.

Findings

01

Improved inference of plausible co-occurrences in semantic models.

02

Enhanced performance of APTs in semantic composition tasks.

03

Reduction of sparsity problems in distributional semantic models.

Abstract

Count-based distributional semantic models suffer from sparsity due to unobserved but plausible co-occurrences in any text collection. This problem is amplified for models like Anchored Packed Trees (APTs), that take the grammatical type of a co-occurrence into account. We therefore introduce a novel form of distributional inference that exploits the rich type structure in APTs and infers missing data by the same mechanism that is used for semantic composition.

Tables4

Table 1. Table 1: Sample of vectorised features for the Apt s shown in Figure 1 . Offsetting white by amod creates an offset view, white amod superscript white amod \emph{white}^{\texttt{amod}} , representing a noun, and has the consequence of aligning the feature space with clothes .

white	${white}^{amod}$	clothes
:clean	amod:clean	amod:wet
$\bar{amod}$ :shoes	:shoes	:dress
$\bar{amod}$ . $\bar{dobj}$ :wear	$\bar{dobj}$ :wear	$\bar{dobj}$ :wear
$\bar{amod}$ . $\bar{nsubj}$ .earn	$\bar{nsubj}$ :earn	$\bar{nsubj}$ :admit

Table 2. Table 2: List of the 10 10 10 nearest neighbours of amod , dobj ¯ ¯ dobj \overline{\mbox{dobj}} and nsubj ¯ ¯ nsubj \overline{\mbox{nsubj}} offset representations.

Offset Representation	Nearest Neighbours
${ancient}^{amod}$	civilzation, mythology, tradition, ruin, monument, trackway, tomb, antiquity, folklore, deity
${red}^{amod}$	${blue}^{amod}$ , ${black}^{amod}$ , ${green}^{amod}$ , ${dark}^{amod}$ , onion, pepper, red, tomato, carrot, garlic
${economic}^{amod}$	${political}^{amod}$ , ${societal}^{amod}$ , cohabiting, economy, growth, cohabitant, globalisation, competitiveness,
	globalization, prosperity
${government}^{\bar{dobj}}$	overthrow, ${party}^{\bar{dobj}}$ , ${authority}^{\bar{dobj}}$ , ${leader}^{\bar{dobj}}$ , ${capital}^{\bar{dobj}}$ , ${force}^{\bar{dobj}}$ , ${state}^{\bar{dobj}}$ , ${official}^{\bar{dobj}}$ , ${minister}^{\bar{dobj}}$ , oust
${problem}^{\bar{dobj}}$	${difficulty}^{\bar{dobj}}$ , solve, coded, ${issue}^{\bar{dobj}}$ , ${injury}^{\bar{dobj}}$ , overcome, ${question}^{\bar{dobj}}$ , think, ${loss}^{\bar{dobj}}$ , relieve
${law}^{\bar{dobj}}$	violate, ${rule}^{\bar{dobj}}$ , enact, repeal, ${principle}^{\bar{dobj}}$ , unmake, enforce, ${policy}^{\bar{dobj}}$ , obey, flout
${researcher}^{\bar{nsubj}}$	${physician}^{\bar{nsubj}}$ , ${writer}^{\bar{nsubj}}$ , theorize, thwart, theorise, hypothesize, surmise, ${student}^{\bar{nsubj}}$ , ${worker}^{\bar{nsubj}}$ , apprehend
${mother}^{\bar{nsubj}}$	${wife}^{\bar{nsubj}}$ , ${father}^{\bar{nsubj}}$ , ${parent}^{\bar{nsubj}}$ , ${woman}^{\bar{nsubj}}$ , re-married, remarry, ${girl}^{\bar{nsubj}}$ ,breastfeed, ${family}^{\bar{nsubj}}$ , disown
${law}^{\bar{nsubj}}$	${rule}^{\bar{nsubj}}$ , ${principle}^{\bar{nsubj}}$ , ${policy}^{\bar{nsubj}}$ , criminalize, ${case}^{\bar{nsubj}}$ , ${contract}^{\bar{nsubj}}$ , prohibit, proscribe, enjoin, ${charge}^{\bar{nsubj}}$

Table 3. Table 3: Comparison of DI algorithms. ‡ ‡ \ddagger denotes statistical significance at p < 0.01 𝑝 0.01 p<0.01 in comparison to the method without DI, * denotes statistical significance at p < 0.01 𝑝 0.01 p<0.01 in comparison to standard DI and † † \dagger denotes statistical significance at p < 0.05 𝑝 0.05 p<0.05 in comparison to standard DI.

	ML10				ML08
Apt configuration	AN	NN	VO	Avg	VO
None	$0.35$	$0.50$	$0.39$	$0.41$	$0.22$
Standard DI	${0.48}^{‡}$	$0.51$	${0.43}^{‡}$	${0.47}^{‡}$	${0.29}^{‡}$
Offset Inference	${0.49}^{‡}$	0.52	${0.44}^{‡}$	${0.48}^{* ‡}$	${0.31}^{† ‡}$

Table 4. Table 4: Comparison with existing methods.

Model	ML10 - Average	ML08
Our work	0.48	0.31
Blacoe and Lapata (2012)	$0.44$	-
Hashimoto et al. (2014)	0.48	-
Weir et al. (2016)	$0.43$	$0.26$
Dinu et al. (2013)	-	$0.23 - 0.26$
Erk and Padó (2008)	-	$0.27$

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tttthomasssss/acl2017
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Improving Semantic Composition with Offset Inference

Thomas Kober, Julie Weeds, Jeremy Reffin

David Weir

TAG laboratory, Department of Informatics, University of Sussex

Brighton, BN1 9RH, UK

{t.kober, j.e.weeds, j.p.reffin, d.j.weir}@sussex.ac.uk

Abstract

Count-based distributional semantic models suffer from sparsity due to unobserved but plausible co-occurrences in any text collection. This problem is amplified for models like Anchored Packed Trees (Apts), that take the grammatical type of a co-occurrence into account. We therefore introduce a novel form of distributional inference that exploits the rich type structure in Apts and infers missing data by the same mechanism that is used for semantic composition.

1 Introduction

Anchored Packed Trees (Apts) is a recently proposed approach to distributional semantics that takes distributional composition to be a process of lexeme contextualisation (Weir et al., 2016). A lexeme’s meaning, characterised as knowledge concerning co-occurrences involving that lexeme, is represented with a higher-order dependency-typed structure (the Apt) where paths associated with higher-order dependencies connect vertices associated with weighted lexeme multisets. The central innovation in the compositional theory is that the Apt’s type structure enables the precise alignment of the semantic representation of each of the lexemes being composed. Like other count-based distributional spaces, however, it is prone to considerable data sparsity, caused by not observing all plausible co-occurrences in the given data. Recently, Kober et al. (2016) introduced a simple unsupervised algorithm to infer missing co-occurrence information by leveraging the distributional neighbourhood and ease the sparsity effect in count-based models.

In this paper, we generalise distributional inference (DI) in Apts and show how precisely the same mechanism that was introduced to support distributional composition, namely “offsetting” Apt representations, gives rise to a novel form of distributional inference, allowing us to infer co-occurrences from neighbours of these representations. For example, by transforming a representation of white to a representation of “things that can be white”, inference of unobserved, but plausible, co-occurrences can be based on finding near neighbours (which will be nouns) of the “things that can be white” structure. This furthermore exposes an interesting connection between distributional inference and distributional composition. Our method is unsupervised and maintains the intrinsic interpretability of Apts111We release our code and data at https://github.com/tttthomasssss/acl2017.

2 Offset Representations

The basis of how composition is modelled in the Apt framework is the way that the co-occurrences are structured. In characterising the distributional semantics of some lexeme $w$ , rather than just recording a co-occurrence between $w$ and $w^{\prime}$ within some context window, we follow Padó and Lapata (2007) and record the dependency path from $w$ to $w^{\prime}$ . This syntagmatic structure makes it possible to appropriately offset the semantic representations of each of the lexemes being composed in some phrase. For example many nouns will have distributional features starting with the type amod, which cannot be observed for adjectives or verbs. Thus, when composing the adjective white with the noun clothes, the feature spaces of the two lexemes need to be aligned first. This can be achieved by offsetting one of the constituents, which we will explain in more detail in this section.

We will make use of the following notation throughout this work. A typed distributional feature consists of a path and a lexeme such as in amod:white. Inverse paths are denoted by a horizontal bar above the dependency relation such as in $\overline{\mbox{dobj}}$ :prefer and higher-order paths are separated by a dot such as in $\overline{\mbox{amod}}$ . $\overline{\mbox{compound}}$ :dress.

Offset representations are the central component in the composition process in the Apt framework. Figure 1 shows the Apt representations for the adjective white (left) and the Apt for the noun clothes (right), as might have been observed in a text collection. Each node holds a multiset of lexemes and the anchor of an Apt reflects the current perspective of a lexeme at the given node. An offset representation can be created by shifting the anchor along a given path. For example the lexeme white is at the same node as other adjectives such as black and clean, whereas nouns such as shoes or noise are typically reached via the $\overline{\mbox{amod}}$ edge.

Offsetting in Apts only involves a change in the anchor, the underlying structure remains unchanged. By offsetting the lexeme white by amod the anchor is shifted along the $\overline{\mbox{amod}}$ edge, which results in creating a noun view for the adjective white. We denote the offset view of a lexeme for a given path by superscripting the offset path, for example the amod offset of the adjective white is denoted as $\emph{white}^{\texttt{amod}}$ . The offsetting procedure changes the starting points of the paths as visible in Figure 1 between the anchors for white and $\emph{white}^{\texttt{amod}}$ , since paths always begin at the anchor. The red dashed line in Figure 1 reflects that anchor shift. The lexeme $\emph{white}^{\texttt{amod}}$ represents a prototypical “white thing”, that is, a noun that has been modified by the adjective white. We note that all edges in the Apt space are bi-directional as exemplified in the coloured amod and $\overline{\mbox{amod}}$ edges in the Apt for white, however for brevity we only show uni-directional edges in Figure 1.

By considering the Apt representations for the lexemes white and clothes in Figure 1, it becomes apparent that lexemes with different parts of speech are located in different areas of the semantic space. If we want to compose the adjective-noun phrase white clothes, we need to offset one of the two constituents to align the feature spaces in order to leverage their distributional commonalities. This can be achieved by either creating a noun offset view of white, by shifting the anchor along the $\overline{\mbox{amod}}$ edge, or by creating an adjective offset representation of clothes by shifting its anchor along amod. In this work we follow Weir et al. (2016) and always offset the dependent in a given relation. Table 1 shows a subset of the features of Figure 1 as would be represented in a vectorised Apt. Vectorising the whole Apt lexicon results in a very high-dimensional and sparse typed distributional space. The features for $\emph{white}^{\texttt{amod}}$ (middle column) highlight the change in feature space caused by offsetting the adjective white. The features of the offset view $\emph{white}^{\texttt{amod}}$ , are now aligned with the noun clothes such that the two can be composed. Composition can be performed by either selecting the union or intersection of the aligned features.

2.1 Qualitative Analysis of Offset Representations

Any offset view of a lexeme is behaviourally identical to a “normal” lexeme. It has an associated part of speech, a distributional representation which locates it in semantic space, and we can find neighbours for it in the same way that we find neighbours for any other lexeme. In this way, a single Apt data structure is able to provide many different views of any given lexeme. These views reflect the different ways in which the lexeme is used. For example $\emph{law}^{\texttt{$ \overline{\mbox{nsubj}} $}}$ is the $\overline{\mbox{nsubj}}$ offset representation of the noun law. This lexeme is a verb and represents an action carried out by the law. This contrasts with $\emph{law}^{\texttt{$ \overline{\mbox{dobj}} $}}$ , which is the $\overline{\mbox{dobj}}$ offset representation of the noun law. It is also a verb, however represents actions done to the law. Table 2 lists the $10$ nearest neighbours for a number of lexemes, offset by amod, $\overline{\mbox{dobj}}$ and $\overline{\mbox{nsubj}}$ respectively.

For example, the neighbourhood of the lexeme ancient in Table 2 shows that the offset view for $\emph{ancient}^{\texttt{amod}}$ is a prototypical representation of an “ancient thing”, with neighbours easily associated with the property ancient. Furthermore, Table 2 illustrates that nearest neighbours of offset views are often other offset representations. This means that for example actions carried out by a mother tend to be similar to actions carried out by a father or a parent.

2.2 Offset Inference

Our approach generalises the unsupervised algorithm proposed by Kober et al. (2016), henceforth “standard DI”, as a method for inferring missing knowledge into an Apt representation. Rather than simply inferring potentially plausible, but unobserved co-occurrences from near distributional neighbours, inferences can be made involving offset Apts. For example, the adjective white can be offset so that it represents a noun — a prototypical “white thing”. This allows inferring plausible co-occurrences from other “things that can be white”, such as shoes or shirts. Our algorithm therefore reflects the contextualised use of a word. This has the advantage of being able to make flexible and fine grained distinctions in the inference process. For example if the noun law is used as a subject, our algorithm allows inferring plausible co-occurrences from “other actions carried out by the law”. This contrasts the use of law as an object, where offset inference is able to find co-occurrences on the basis of “other actions done to the law”. This is a crucial advantage over the method of Kober et al. (2016) which only supports inference on uncontextualised lexemes.

A sketch of how offset inference for a lexeme $w$ works is shown in Algorithm 1. Our algorithm requires a distributional model $M$ , an Apt representation for the lexeme $w$ for which to perform offset inference, a dependency path $p$ , describing the offset for $w$ , and the number of neighbours $k$ . The offset representation of $w^{\prime}$ is then enriched with the information from its distributional neighbours by some merge function. We note that if the offset path $p$ is the empty path, we would recover the algorithm presented by Kober et al. (2016). Our algorithm is unsupervised, and agnostic to the input distributional model and the neighbour retrieval function.

Connection to Distributional Composition

An interesting observation is the similarity between distributional inference and distributional composition, as both operations are realised by the same mechanism — an offset followed by inferring plausible co-occurrence counts for a single lexeme in the case of distributional inference, or for a phrase in the case of composition. The merging of co-occurrence dimensions for distributional inference can also be any of the operations commonly used for distributional composition such as pointwise minimum, maximum, addition or multiplication.

This relation creates an interesting dynamic between distributional inference and composition when used in a complementary manner as in this work. The former can be used as a process of co-occurrence embellishment which is adding missing information, however with the risk of introducing some noise. The latter on the other hand can be used as a process of co-occurrence filtering, that is leveraging the enriched representations, while also sieving out the previously introduced noise.

3 Experiments

For our experiments we re-implemented the standard DI method of Kober et al. (2016) for a direct comparison. We built an order 2 Apt space on the basis of the concatenation of ukWaC, Wackypedia and the BNC (Baroni et al., 2009), pre-parsed with the Malt parser (Nivre et al., 2006). We PPMI transformed the raw co-occurrence counts prior to composition, using a negative SPPMI shift of $\log 5$ (Levy and Goldberg, 2014b). We also experimented with composing normalised counts and applying the PPMI transformation after composition as done by Weeds et al. (2017), however found composing PPMI scores to work better for this task.

We evaluate our offset inference algorithm on two popular short phrase composition benchmarks by Mitchell and Lapata (2008) and Mitchell and Lapata (2010), henceforth ML08 and ML10 respectively. The ML08 dataset consists of 120 distinct verb-object (VO) pairs and the ML10 dataset contains 108 adjective-noun (AN), 108 noun-noun (NN) and 108 verb-object pairs. The goal is to compare a model’s similarity estimates to human provided judgements. For both tasks, each phrase pair has been rated by multiple human annotators on a scale between 1 and 7, where 7 indicates maximum similarity. Comparison with human judgements is achieved by calculating Spearman’s $\rho$ between the model’s similarity estimates and the scores of each human annotator individually. We performed composition by intersection and tuned the number of neighbours by a grid search over {0, 10, 30, 50, 100, 500, 1000} on the ML10 development set, selecting 10 neighbours for NNs, 100 for ANs and 50 for VOs for both DI algorithms. We calculate statistical significance using the method of Steiger (1980).

Effect of the number of neighbours

Figure 2 shows the effect of the number of neighbours for AN, NN and VO phrases, using offset inference, on the ML10 development set. Interestingly, NN compounds exhibit an early saturation effect, while VOs and ANs require more neighbours for optimal performance. One explanation for the observed behaviour is that up to some threshold, the neighbours being added contribute actually missing co-occurrence events, whereas past that threshold distributional inference degrades to just generic smoothing that is simply compensating for sparsity, but overwhelming the representations with non-plausible co-occurrence information. A similar effect has also been observed by Erk and Pado (2010) in an exemplar-based model.

Results

Table 3 shows that both forms of distributional inference significantly outperform a baseline without DI. On average, offset inference outperforms the method of Kober et al. (2016) by a statistically significant margin on both datasets.

Table 4 shows that offset inference substantially outperforms comparable sparse models by Dinu et al. (2013) on ML08, achieving a new state-of-the-art, and matches the performance of the state-of-the-art neural network model of Hashimoto et al. (2014) on ML10, while being fully interpretable.

4 Related Work

Distributional inference has its roots in the work of Dagan et al. (1993, 1994), who aim to find probability estimates for unseen words in bigrams, and Schütze (1992, 1998) who leverages the distributional neighbourhood through clustering of contexts for word-sense discrimination. Recently Kober et al. (2016) revitalised the idea for compositional distributional semantic models.

Composition with distributional semantic models has become a popular research area in recent years. Simple, yet competitive methods, are based on pointwise vector addition or multiplication (Mitchell and Lapata, 2008, 2010). However, these approaches neglect the structure of the text defining composition as a commutative operation.

A number of approaches proposed in the literature attempt to overcome this shortcoming by introducing weighted additive variants (Guevara, 2010, 2011; Zanzotto et al., 2010). Another popular strand of work models semantic composition on the basis of ideas arising in formal semantics. Composition in such models is usually implemented as operations on higher-order tensors (Baroni and Zamparelli, 2010; Baroni et al., 2014; Coecke et al., 2011; Grefenstette et al., 2011; Grefenstette and Sadrzadeh, 2011; Grefenstette et al., 2013; Kartsaklis and Sadrzadeh, 2014; Paperno et al., 2014; Tian et al., 2016; Van de Cruys et al., 2013). Another widespread approach to semantic composition is to use neural networks (Bowman et al., 2016; Hashimoto et al., 2014; Hill et al., 2016; Mou et al., 2015; Socher et al., 2012, 2014; Wieting et al., 2015; Yu and Dredze, 2015), or convolutional tree kernels (Croce et al., 2011; Zanzotto and Dell’Arciprete, 2012; Annesi et al., 2014) as composition functions.

The above approaches are applied to untyped distributional vector space models where untyped models contrast with typed models (Baroni and Lenci, 2010) in terms of whether structural information is encoded in the representation as in the models of Erk and Padó (2008); Gamallo and Pereira-Fariña (2017); Levy and Goldberg (2014a); Padó and Lapata (2007); Thater et al. (2010, 2011); Weeds et al. (2014).

The perhaps most popular approach in the literature to evaluating compositional distributional semantic models is to compare human word and phrase similarity judgements with similarity estimates of composed meaning representations, under the assumption that better distributional representations will perform better at these tasks (Blacoe and Lapata, 2012; Dinu et al., 2013; Erk and Padó, 2008; Hashimoto et al., 2014; Hermann and Blunsom, 2013; Kiela et al., 2014; Turney, 2012).

5 Conclusion

In this paper we have introduced a novel form of distributional inference that generalises the method introduced by Kober et al. (2016). We have shown its effectiveness for semantic composition on two benchmark phrase similarity tasks where we achieved state-of-the-art performance while retaining the interpretability of our model. We have furthermore highlighted an interesting connection between distributional inference and distributional composition.

In future work we aim to apply our novel method to improve modelling selectional preferences, lexical inference, and scale up to longer phrases and full sentences.

Bibliography51

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Annesi et al. (2014) Paolo Annesi, Danilo Croce, and Roberto Basili. 2014. Semantic compositionality in tree kernels . In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management . ACM, New York, NY, USA, CIKM ’14, pages 1029–1038. https://doi.org/10.1145/2661829.2661955 . · doi ↗
2Baroni et al. (2014) Marco Baroni, Raffaella Bernardi, and Roberto Zamparelli. 2014. Frege in space: A program for compositional distributional semantics. Linguistic Issues in Language Technology 9(6):5–110.
3Baroni et al. (2009) Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. 2009. The wacky wide web: a collection of very large linguistically processed web-crawled corpora . Language Resources and Evaluation 43(3):209–226. https://doi.org/10.1007/s 10579-009-9081-4 . · doi ↗
4Baroni and Lenci (2010) Marco Baroni and Alessandro Lenci. 2010. Distributional memory: A general framework for corpus-based semantics. Computational Linguistics 36(4):673–721.
5Baroni and Zamparelli (2010) Marco Baroni and Roberto Zamparelli. 2010. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space . In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing . Association for Computational Linguistics, Cambridge, MA, pages 1183–1193. http://www.aclweb.org/anthology/D 10-1115 .
6Blacoe and Lapata (2012) William Blacoe and Mirella Lapata. 2012. A comparison of vector-based representations for semantic composition . In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning . Association for Computational Linguistics, Jeju Island, Korea, pages 546–556. http://www.aclweb.org/anthology/D 12-1050 .
7Bowman et al. (2016) Samuel R. Bowman, Jon Gauthier, Abhinav Rastogi, Raghav Gupta, Christopher D. Manning, and Christopher Potts. 2016. A fast unified model for parsing and sentence understanding . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Association for Computational Linguistics, Berlin, Germany, pages 1466–1477. http://www.aclweb.org/anthology/P 16-1139 .
8Coecke et al. (2011) Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen Clark. 2011. Mathematical foundations for a compositional distributed model of meaning. Linguistic Analysis 36(1-4):345–384.