Composition of Sentence Embeddings:Lessons from Statistical Relational Learning
Damien Sileo, Tim Van-De-Cruys, Camille Pradel, Philippe Muller

TL;DR
This paper explores the use of advanced statistical relational learning models for composing sentence embeddings, demonstrating that these models are more expressive and improve performance on relation prediction and sentence representation tasks.
Contribution
It reveals limitations of traditional composition methods in NLP and introduces SRL-based models that enhance expressiveness and accuracy in semantic relation tasks.
Findings
SRL-based compositions outperform simple methods in relation prediction
Advanced models significantly improve state-of-the-art results
Traditional compositions are insufficient for complex NLP tasks
Abstract
Various NLP problems -- such as the prediction of sentence similarity, entailment, and discourse relations -- are all instances of the same general task: the modeling of semantic relations between a pair of textual elements. A popular model for such problems is to embed sentences into fixed size vectors, and use composition functions (e.g. concatenation or sum) of those vectors as features for the prediction. At the same time, composition of embeddings has been a main focus within the field of Statistical Relational Learning (SRL) whose goal is to predict relations between entities (typically from knowledge base triples). In this article, we show that previous work on relation prediction between texts implicitly uses compositions from baseline SRL models. We show that such compositions are not expressive enough for several tasks (e.g. natural language inference). We build on recent SRL…
| Model | Scoring function | Parameters | ||
|---|---|---|---|---|
| Unstructured | - | |||
| TransE | ||||
| RESCAL | ||||
| DistMult | ||||
| ComplEx |
| name | N | task | C | representation(s) used |
|---|---|---|---|---|
| MR | 11k | sentiment (movies) | 2 | |
| SUBJ | 10k | subjectivity/objectivity | 2 | |
| MPQA | 11k | opinion polarity | 2 | |
| TREC | 6k | question-type | 6 | |
| 10k | NLI | 3 | ||
| 4k | paraphrase detection | 2 | ||
| 17k | discursive relation | 5 | ||
| STS14 | 4.5k | similarity | - |
| Models trained on natural language inference () | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| m,s | MR | SUBJ | MPQA | TREC | STS14 | AVG | ||||
| 81.2 | 92.7 | 90.4 | 89.6 | 76.1 | 46.7 | 86.6 | 69.5 | 84.2 | 79.1 | |
| 81.4 | 92.8 | 90.5 | 89.6 | 75.4 | 46.6 | 86.7 | 69.5 | 84.3 | 79.1 | |
| 81.2 | 92.6 | 90.5 | 89.6 | 76 | 46.5 | 86.6 | 69.5 | 84.2 | 79.1 | |
| 81.1 | 92.7 | 90.5 | 89.7 | 76.5 | 46.4 | 86.5 | 70.0 | 84.8 | 79.2 | |
| 81.3 | 92.6 | 90.6 | 89.2 | 76.2 | 47.2 | 86.5 | 70.0 | 84.6 | 79.2 | |
| 81.2 | 92.7 | 90.4 | 88.5 | 75.8 | 47.3 | 86.8 | 69.8 | 84.2 | 79.1 | |
| Models trained on discourse connective prediction () | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| m,s | MR | SUBJ | MPQA | TREC | STS14 | AVG | ||||
| 80.4 | 92.7 | 90.2 | 89.5 | 74.5 | 47.3 | 83.2 | 57.9 | 35.7 | 77 | |
| 80.4 | 92.9 | 90.2 | 90.2 | 75 | 47.9 | 83.3 | 57.8 | 35.9 | 77.2 | |
| 80.2 | 92.8 | 90.2 | 88.4 | 74.9 | 47.5 | 82.9 | 57.7 | 35.9 | 76.8 | |
| 80.2 | 92.8 | 90.2 | 90.4 | 74.6 | 48.5 | 83.4 | 58.6 | 36.1 | 77.3 | |
| 80.2 | 92.9 | 90.3 | 90.3 | 75.1 | 47.8 | 83.2 | 58.3 | 36.1 | 77.3 | |
| 80.2 | 92.8 | 90.3 | 89.7 | 74.4 | 47.9 | 83.7 | 58.2 | 35.7 | 77.2 | |
| Comparison models | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| model | MR | SUBJ | MPQA | TREC | STS14 | AVG | |||
| Infersent | 81.1 | 92.4 | 90.2 | 88.2 | 76.2 | 46.7- | 86.3 | 70 | 78.9 |
| SkipT | 76.5 | 93.6 | 87.1 | 92.2 | 73 | - | 82.3 | 29 | - |
| BoW | 77.2 | 91.2 | 87.9 | 83 | 72.2 | 43.9 | 78.4 | 54.6 | 73.6 |
| m,s | AVG | AVG | ||||||
|---|---|---|---|---|---|---|---|---|
| 74.8 | 48.2 | 83.6 | 68.9 | 76.2 | 47.2 | 86.9 | 70.1 | |
| 74.9 | 49.3 | 83.8 | 69.3 | 75.9 | 47.1 | 86.9 | 70 | |
| 75 | 48.8 | 83.4 | 69.1 | 75.8 | 47 | 87 | 69.9 | |
| 74.9 | 48.7 | 83.6 | 69.1 | 76.2 | 47.8 | 86.8 | 70.3 | |
| 75.2 | 48.6 | 83.5 | 69.1 | 76.2 | 47.6 | 87.3 | 70.4 | |
| 74.6 | 48.9 | 83.9 | 69.1 | 76.2 | 47.8 | 87 | 70.3 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Composition of Sentence Embeddings:
Lessons from Statistical Relational Learning
Damien Sileo
IRIT, University of Toulouse, France
Synapse Développement, Toulouse, France
Tim Van De Cruys
IRIT, CNRS, France
Camille Pradel
Synapse Développement, Toulouse, France
Philippe Muller
IRIT, University of Toulouse, France
Abstract
Various NLP problems – such as the prediction of sentence similarity, entailment, and discourse relations – are all instances of the same general task: the modeling of semantic relations between a pair of textual elements. A popular model for such problems is to embed sentences into fixed size vectors, and use composition functions (e.g. concatenation or sum) of those vectors as features for the prediction. At the same time, composition of embeddings has been a main focus within the field of Statistical Relational Learning (SRL) whose goal is to predict relations between entities (typically from knowledge base triples). In this article, we show that previous work on relation prediction between texts implicitly uses compositions from baseline SRL models. We show that such compositions are not expressive enough for several tasks (e.g. natural language inference). We build on recent SRL models to address textual relational problems, showing that they are more expressive, and can alleviate issues from simpler compositions. The resulting models significantly improve the state of the art in both transferable sentence representation learning and relation prediction.
1 Introduction
Predicting relations between textual units is a widespread task, essential for discourse analysis, dialog systems, information retrieval, or paraphrase detection. Since relation prediction often requires a form of understanding, it can also be used as a proxy to learn transferable sentence representations. Several tasks that are useful to build sentence representations are derived directly from text structure, without human annotation: sentence order prediction (Logeswaran et al., 2016; Jernite et al., 2017), the prediction of previous and subsequent sentences (Kiros et al., 2015; Jernite et al., 2017), or the prediction of explicit discourse markers between sentence pairs (Nie et al., 2017; Jernite et al., 2017). Human labeled relations between sentences can also be used for that purpose, e.g. inferential relations (Conneau et al., 2017). While most work on sentence similarity estimation, entailment detection, answer selection, or discourse relation prediction seemingly uses task-specific models, they all involve predicting whether a relation holds between two sentences and . This genericity has been noticed in the literature before (Baudiš et al., 2016) and it has been leveraged for the evaluation of sentence embeddings within the SentEval framework (Conneau et al., 2017).
A straightforward way to predict the probability of being true is to represent and with -dimensional embeddings and , and to compute sentence pair features , where is a composition function (e.g. concatenation, product, …). A softmax classifier can learn to predict with those features. can be seen as a reasoning based on the content of and (Socher et al., 2013).
Our contributions are as follows:
- –
we review composition functions used in textual relational learning and show that they lack expressiveness (section 2);
- –
we draw analogies with existing SRL models (section 3) and design new compositions inspired from SRL (section 4);
- –
we perform extensive experiments to test composition functions and show that some of them can improve the learning of representations and their downstream uses (section 6).
2 Composition functions for relation prediction
We review here popular composition functions used for relation prediction based on sentence embeddings. Ideally, they should simultaneously fulfill the following minimal requirements:
- –
make use of interactions between representations of sentences to relate;
- –
allow for the learning of asymmetric relations (e.g. entailment, order);
- –
be usable with high dimensionalities (parameters and should fit in GPU memory).
Additionally, if the main goal is transferable sentence representation learning, compositions should also incentivize gradually changing sentences to lie on a linear manifold, since transfer usually uses linear models. Another goal can be learning of transferable relation representation. Concretely, a sentence encoder and can be trained on a base task, and can be used as features for transfer in another task. In that case, the geometry of the sentence embedding space is less relevant, as long as the space works well for transfer learning. Our evaluation will cover both cases.
A straightforward instantiation of is concatenation (Hooda & Kosseim, 2017):
[TABLE]
However, interactions between and cannot be modeled with followed by a softmax regression. Indeed, can be rewritten as a sum of independent contributions from and , namely . Using a multi-layer perceptron before the softmax would solve this issue, but it also harms sentence representation learning (Conneau et al., 2017; Logeswaran & Lee, 2018), possibly because the perceptron allows for accurate predictions even if the sentence embeddings lie in a convoluted space. To promote interactions between and , element-wise product has been used in Baudiš et al. (2016):
[TABLE]
Absolute difference is another solution for sentence similarity (Mueller & Thyagarajan, 2016), and its element-wise variation may equally be used to compute informative features:
[TABLE]
The latter two were combined into a popular instantiation, sometimes refered as heuristic matching (Tai et al., 2015; Kiros et al., 2015; Mou et al., 2015):
[TABLE]
Although effective for certain similarity tasks, is symmetrical, and should be a poor choice for tasks like entailment prediction or prediction of discourse relations. For instance, if denotes entailment and = (“It just rained”, “The ground is wet”), should hold but not . The composition function is nonetheless used to train/evaluate models on entailment (Conneau et al., 2017) or discourse relation prediction (Nie et al., 2017).
Sometimes is concatenated to (Ampomah et al., 2016; Conneau et al., 2017). While the resulting composition is asymmetrical, the asymmetrical component involves no interaction as noted previously. We note that this composition is very commonly used. On the SNLI benchmark,111nlp.stanford.edu/projects/snli/, as of February 2019. out of the listed sentence embedding based models use it, and use a weaker form (e.g. omitting ).
The outer product has been used instead for asymmetric multiplicative interaction (Jernite et al., 2017):
[TABLE]
This formulation is expressive but it forces to have parameters per relation, which is prohibitive when there are many relations and is high.
The problems outlined above are well known in SRL. Thus, existing compositions (except ) can only model relations superficially for tasks currently used to train state of the art sentence encoders, like NLI or discourse connectives prediction.
3 Statistical Relational Learning models
In this section we introduce the context of statistical relational learning (SRL) and relevant models. Recently, SRL has focused on efficient and expressive relation prediction based on embeddings. A core goal of SRL (Getoor & Taskar, 2007) is to induce whether a relation holds between two arbitrary entities . As an example, we would like to assign a score to = (Paris, located_in, France) that reflects a high probability. In embedding-based SRL models, entities have vector representations in and a scoring function reflects truth values of relations. The scoring function should allow for relation-dependent reasoning over the latent space of entities. Scoring functions can have relation-specific parameters, which can be interpreted as relation embeddings. Table 1 presents an overview of a number of state of the art relational models. We can distinguish two families of models: subtractive and multiplicative.
The TransE scoring function is motivated by the idea that translations in latent space can model analogical reasoning and hierarchical relationships. Dense word embeddings trained on tasks related to the distributional hypothesis naturally allow for analogical reasoning with translations without explicit supervision (Mikolov et al., 2013). TransE generalizes the older Unstructured model. We call them subtractive models.
The RESCAL, Distmult, and ComplEx scoring functions can be seen as dot product matching between and a relation-specific linear transformation of (Liu et al., 2017). This transformation helps checking whether matches with some aspects of . RESCAL allows a full linear mapping but has a high complexity, while Distmult is restricted to a component-wise weighting . ComplEx has fewer parameters than RESCAL but still allows for the modeling of asymmetrical relations. As shown in Liu et al. (2017), ComplEx boils down to a restriction of RESCAL where is a block diagonal matrix. These blocks are 2-dimensional, antisymmetric and have equal diagonal terms. Using such a form, even and odd indexes of ’s dimensions play the roles of real and imaginary numbers respectively. The ComplEx model (Trouillon et al., 2016) and its variations (Lacroix et al., 2018) yield state of the art performance on knowledge base completion on numerous evaluations.
4 Embeddings composition as SRL models
We claim that several existing models (Conneau et al., 2017; Nie et al., 2017; Baudiš et al., 2016) boil down to SRL models where the sentence embeddings ( act as entity embeddings (). This framework is depicted in figure 1. In this article we focus on sentence embeddings, although our framework can straightforwardly be applied to other levels of language granularity (such as words, clauses, or documents).
Some models (Chen et al., 2017b; Seo et al., 2016; Gong et al., 2018; Radford, 2018; Devlin et al., 2018) do not rely on explicit sentence encodings to perform relation prediction. They combine information of input sentences at earlier stages, using conditional encoding or cross-attention. There is however no straightforward way to derive transferable sentence representations in this setting, and so these models are out of the scope of this paper. They sometimes make use of composition functions, so our work could still be relevant to them in some respect.
In this section we will make a link between sentence composition functions and SRL scoring functions, and propose new scoring functions drawing inspiration from SRL.
4.1 Linking composition functions and SRL models
The composition function from equation 2 followed by a softmax regression yields a score whose analytical form is identical to the Distmult model score described in section 3. Let denote the softmax weights for relation . The logit score for the truth of is which is equal to the Distmult scoring function if act as entities embeddings and as the relation weight .
Similarly, the composition from equation 3 followed by a softmax regression can be seen as an element-wise weighted score of Unstructured (both are equal if softmax weights are all unitary).
Thus, from 4 (with softmax regression) can be seen as a weighted ensemble of Unstructured and Distmult. These two models are respectively outperformed by TransE and ComplEx on knowledge base link prediction by a large margin (Trouillon et al., 2016; Bordes et al., 2013a). We therefore propose to change the Unstructured and Distmult in such that they match their respective state of the art variations in the following sections. We will also show the implications of these refinements.
4.2 Casting TransE as a composition
Simply replacing with
[TABLE]
would make the model analogous to TransE. is learned and is shared by all relations. A relation-specific translation could be used but it would make relation-specific. Instead, here, each dimension of can be weighted according to a given relation. Non-zero makes asymmetrical and also yields features that allow for the checking of an analogy between and . Sentence embeddings often rely on pre-trained word embeddings which have demonstrated strong capabilities for analogical reasoning. Some analogies, such as part-whole, are computable with off-the-shelf word embeddings (Chen et al., 2017a) and should be very informative for natural language inference tasks. As an illustration, let us consider an artificial semantic space (depicted in figures 2(a) and 2(b)) where we posit that there is a “to the past” translation so that is the embedding of a sentence changed to the past tense. Unstructured is not able to leverage this semantic space to correctly score ) while TransE is well tailored to provide highest scores for sentences near where is an estimation of that could be learned from examples.
4.3 Casting ComplEx as a composition
Let us partition dimensions into two equally sized sets and , e.g. even and odd dimension indices of . We propose a new function as a way to fit the ComplEx scoring function into a composition function.
[TABLE]
multiplied by softmax weights is equivalent to the ComplEx scoring function . The first half of weights corresponds to the real part of ComplEx relation weights while the last half corresponds to the imaginary part.
is to the ComplEx scoring function what is to the DistMult scoring function. Intuitively, ComplEx is a minimal way to model interactions between distinct latent dimensions while Distmult only allows for identical dimensions to interact.
Let us consider a new artificial semantic space (shown in figures 2(c) and 2(d)) where the first dimension is high when a sentence means that it just rained, and the second dimension is high when the ground is wet. Over this semantic space, Distmult is only able to detect entailment for paraphrases whereas ComplEx is also able to naturally model that (“it just rained”, , “the ground is wet”) should be high while its converse should not.
We also propose two more general versions of :
[TABLE]
[TABLE]
can be seen as Distmult concatenated with the asymmetrical part of ComplEx and can be seen as RESCAL with unconstrained block diagonal relation matrices.
5 On the evaluation of relational models
The SentEval framework (Conneau et al., 2017) provides a general evaluation for transferable sentence representations, with open source evaluation code. One only needs to specify a sentence encoder function, and the framework performs classification tasks or relation prediction tasks using cross-validated logistic regression on embeddings or composed sentence embeddings. Tasks include sentiment analysis, entailment, textual similarity, textual relatedness, and paraphrase detection. These tasks are a rich way to train or evaluate sentence representations since in a triple , we can see as a label for (Baudiš et al., 2016). Unfortunately, the relational tasks hard-code the composition function from equation 4. From our previous analysis, we believe this composition function favors the use of contextual/lexical similarity rather than high-level reasoning and can penalize representations based on more semantic aspects. This bias could harm research since semantic representation is an important next step for sentence embedding. Training/evaluation datasets are also arguably flawed with respect to relational aspects since several recent studies (Dasgupta et al., 2018; Poliak et al., 2018; Gururangan et al., 2018; Glockner et al., 2018) show that InferSent, despite being state of the art on SentEval evaluation tasks, has poor performance when dealing with asymmetrical tasks and non-additive composition of words. In addition to providing new ways of training sentence encoders, we will also extend the SentEval evaluation framework with a more expressive composition function when dealing with relational transfer tasks, which improves results even when the sentence encoder was not trained with it.
6 Experiments
Our goal is to show that transferable sentence representation learning and relation prediction tasks can be improved when our expressive compositions are used instead of the composition from equation 4. We train our relational model adaptations on two relation prediction base tasks (), one supervised () and one unsupervised () described below, and evaluate sentence/relation representations on base and transfer tasks using the SentEval framework in order to quantify the generalization capabilities of our models. Since we use minor modifications of InferSent and SentEval, our experiments are easily reproducible.
6.1 Training tasks
Natural language inference ( = NLI)’s goal is to predict whether the relation between two sentences (premise and hypothesis) is Entailment, Contradiction or Neutral. We use the combination of SNLI dataset (Bowman et al., 2015) and MNLI dataset (Williams et al., 2017). We call AllNLI the resulting dataset of examples. Conneau et al. (2017) claim that NLI data allows universal sentence representation learning. They used the composition function with concatenated sentence representations in order to train their Infersent model.
We also train on the prediction of discourse connectives between sentences/clauses ( = Disc). Discourse connectives make discourse relations between sentences explicit. In the sentence I live in Paris but I’m often elsewhere, the word but highlights that there is a contrast between the two clauses it connects. We use Malmi et al.’s (2017) dataset of selected instances with discourse connectives (e.g. however, for example) with the provided train/dev/test split. This dataset has no other supervision than the list of 20 connectives. Nie et al. (2017) used concatenated with the sum of sentence representations to train their model, DisSent, on a similar task and showed that their encoder was general enough to perform well on SentEval tasks. They use a dataset that is, at the time of writing, not publicly available.
6.2 Evaluation tasks
Table 2 provides an overview of different transfer tasks that will be used for evaluation. We added another relation prediction task, the PDTB coarse-grained implicit discourse relation task, to SentEval. This task involves predicting a discursive link between two sentences among Comparison, Contingency, Entity based coherence, Expansion, Temporal. We followed the setup of Pitler et al. (2009), without sampling negative examples in training. MRPC, PDTB and SICK will be tested with two composition functions: besides SentEval composition , we will use for transfer learning evaluation, since it has the most general multiplicative interaction and it does not penalize models that do not learn a translation. For all tasks except STS14, a cross-validated logistic regression is used on the sentence or relation representation. The evaluation of the STS14 task relies on Pearson or Spearman correlation between cosine similarity and the target. We force the composition function to be symmetrical on the MRPC task since paraphrase detection should be invariant to permutation of input sentences.
6.3 Setup
We want to compare the different instances of . We follow the setup of Infersent (Conneau et al., 2017): we learn to encode sentences into with a bi-directional LSTM using element-wise max pooling over time. The dimension size of is . Word embeddings are fixed GloVe with 300 dimensions, trained on Common Crawl 840B.222https://nlp.stanford.edu/projects/glove/ Optimization is done with SGD and decreasing learning rate until convergence.
The only difference with regard to Infersent is the composition. Sentences are composed with six different compositions for training according to the following template:
[TABLE]
(subtractive interaction) is in , (multiplicative interaction) is in . We do not consider since it yielded inferior results in our early experiments using NLI and SentEval development sets.
is fed directly to a softmax regression. Note that Infersent uses a multi-layer perceptron before the softmax, but uses only linear activations, so is analytically equivalent to Infersent when .
6.4 Results
Having run several experiments with different initializations, the standard deviations between them do not seem to be negligible. We decided to take these into account when reporting scores, contrary to previous work (Kiros et al., 2015; Conneau et al., 2017): we average the scores of 6 distinct runs for each task and use standard deviations under normality assumption to compute significance. Table 3 shows model scores for , while Table 4 shows scores for . For comparison, Table 5 shows a number of important models from previous work. Finally, in Table 6, we present results for sentence relation tasks that use an alternative composition function () instead of the standard composition function used in SentEval.
For sentence representation learning, the baseline, composition already performs rather well, being on par with the InferSent scores of the original paper, as would be expected. However, macro-averaging all accuracies, it is the second worst performing model. is the best performing model, and all three best models use the translation (). On relational transfer tasks, training with and using complex for transfer (Table 6) always outperforms the baseline ( with composition in Tables 3 and 4). Averaging accuracies of those transfer tasks, this result is significant for both training tasks at level (using Bonferroni correction accounting for the 5 comparisons). On base tasks and the average of non-relational transfer tasks (MR, MPQA, SUBJ, TREC), our proposed compositions are on average slightly better than . Representations learned with our proposed compositions can still be compared with simple cosine similarity: all three methods using the translational composition () very significantly outperform the baseline (significant at level with Bonferroni correction) on STS14 for . Thus, we believe has more robust results and could be a better default choice than as composition for representation learning. 333Note that our compositions are also beneficial with regard to convergence speed: on average, each of our proposed compositions needed less epochs to converge than the baseline , for both training tasks.
Additionally, using (Table 6) instead of (Tables 3 and 4) for transfer learning in relational transfer tasks (PDTB, MRPC, SICK) yields a significant improvement on average, even when was used for training (). Therefore, we believe is an interesting composition for inference or evaluation of models regardless of how they were trained.
7 Related work
There are numerous interactions between SRL and NLP. We believe that our framework merges two specific lines of work: relation prediction and modeling textual relational tasks.
Some previous NLP work focused on composition functions for relation prediction between text fragments, even though they ignored SRL and only dealt with word units. Word2vec (Mikolov et al., 2013) has sparked a great interest for this task with word analogies in the latent space. Levy & Goldberg (2014) explored different scoring functions between words, notably for analogies. Hypernymy relations were also studied, by Chang et al. (2017) and Fu et al. (2014). Levy et al. (2015) proposed tailored scoring functions. Even the skipgram model (Mikolov et al., 2013) can be formulated as finding relations between context and target words. We did not empirically explore textual relational learning at the word level, but we believe that it would fit in our framework, and could be tested in future studies. Numerous approaches (Chen et al., 2017b; Seok et al., 2016; Gong et al., 2018; Joshi et al., 2018) were proposed to predict inference relations between sentences, but don’t explicitely use sentence embeddings. Instead, they encode sentences jointly, possibly with the help of previously cited word compositions, therefore it would also be interesting to try applying our techniques within their framework.
Some modeling aspects of textual relational learning have been formally investigated by Baudiš et al. (2016). They noticed the genericity of relational problems and explored multi-task and transfer learning on relational tasks. Their work is complementary to ours since their framework unifies tasks while ours unifies composition functions. Subsequent approaches use relational tasks for training and evaluation on specific datasets (Conneau et al., 2017; Nie et al., 2017).
8 Conclusion
We have demonstrated that a number of existing models used for textual relational learning rely on composition functions that are already used in Statistical Relational Learning. By taking into account previous insights from SRL, we proposed new composition functions and evaluated them. These composition functions are all simple to implement and we hope that it will become standard to try them on relational problems. Larger scale data might leverage these more expressive compositions, as well as more compositional, asymmetric, and arguably more realistic datasets (Dasgupta et al., 2018; Gururangan et al., 2018). Finally, our compositions can also be helpful to improve interpretability of embeddings, since they can help measure relation prediction asymmetry. Analogies through translations helped interpreting word embeddings, and perhaps anlyzing our learned translation could help interpreting sentence embeddings.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Ampomah et al. (2016) Isaac K E Ampomah, Seong-bae Park, and Sang-jo Lee. A Sentence-to-Sentence Relation Network for Recognizing Textual Entailment. World Academy of Science, Engineering and Technology International Journal of Computer and Information Engineering , 10(12):1955–1958, 2016.
- 2Baudiš et al. (2016) Petr Baudiš, Jan Pichl, Tomáš Vyskočil, and Jan Šedivý. Sentence Pair Scoring: Towards Unified Framework for Text Comprehension. 2016. URL http://arxiv.org/abs/1603.06127 .
- 3Bordes et al. (2013 a) Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio. A Semantic Matching Energy Function for Learning with Multi-relational Data. Machine Learning , 2013 a. ISSN 0885-6125. doi: 10.1007/s 10994-013-5363-6 . URL http://arxiv.org/abs/1301.3485 .
- 4Bordes et al. (2013 b) Antoine Bordes, Nicolas Usunier, Jason Weston, and Oksana Yakhnenko. Translating Embeddings for Modeling Multi-Relational Data. Advances in NIPS , 26:2787–2795, 2013 b. ISSN 10495258. doi: 10.1007/s 13398-014-0173-7.2 .
- 5Bowman et al. (2015) Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning natural language inference. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing,Lisbon, Portugal, 17-21 September 2015 , (September):632–642, 2015. ISSN 9781941643327.
- 6Chang et al. (2017) Haw-Shiuan Chang, Zi Yun Wang, Luke Vilnis, and Andrew Mc Callum. Unsupervised Hypernym Detection by Distributional Inclusion Vector Embedding. 2017. URL http://arxiv.org/abs/1710.00880 .
- 7Chen et al. (2017 a) Dawn Chen, Joshua C. Peterson, and Thomas L. Griffiths. Evaluating vector-space models of analogy. Co RR , abs/1705.04416, 2017 a.
- 8Chen et al. (2017 b) Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. Enhanced lstm for natural language inference. In Regina Barzilay and Min-Yen Kan (eds.), ACL (1) , pp. 1657–1668. Association for Computational Linguistics, 2017 b. ISBN 978-1-945626-75-3. URL http://dblp.uni-trier.de/db/conf/acl/acl 2017-1.html#Chen ZLWJI 17 .
