Joint Matrix-Tensor Factorization for Knowledge Base Inference

Prachi Jain; Shikhar Murty; Mausam; Soumen Chakrabarti

arXiv:1706.00637·cs.AI·June 5, 2017

Joint Matrix-Tensor Factorization for Knowledge Base Inference

Prachi Jain, Shikhar Murty, Mausam, Soumen Chakrabarti

PDF

Open Access 2 Repos

TL;DR

This paper compares matrix and tensor factorization models for knowledge base inference, introduces a joint model, and evaluates their robustness and dataset-specific performance, proposing improvements for handling out-of-vocabulary entity pairs.

Contribution

It provides a comprehensive comparison of MF and TF models, introduces a joint TF+MF model, and extends evaluation protocols to better handle OOV entity pairs.

Findings

01

The joint TF+MF model performs robustly across datasets.

02

Extended evaluation protocol improves handling of out-of-vocabulary entity pairs.

03

The best model achieves strong, consistent results across all tested datasets.

Abstract

While several matrix factorization (MF) and tensor factorization (TF) models have been proposed for knowledge base (KB) inference, they have rarely been compared across various datasets. Is there a single model that performs well across datasets? If not, what characteristics of a dataset determine the performance of MF and TF models? Is there a joint TF+MF model that performs robustly on all datasets? We perform an extensive evaluation to compare popular KB inference models across popular datasets in the literature. In addition to answering the questions above, we remove a limitation in the standard evaluation protocol for MF models, propose an extension to MF models so that they can better handle out-of-vocabulary (OOV) entity pairs, and develop a novel combination of TF and MF models. We also analyze and explain the results based on models and dataset characteristics. Our best model…

Tables8

Table 1. Table 1: Scoring functions for various models. Larger value implies more confidence in the validity of the triple. ‘ ⋅ ⋅ \cdot ’ denotes dot product and ‘ ∙ ∙ \bullet ’ denotes element-wise multiplication.

Model ( $M$ )	Scoring function ( $ϕ^{M} (e_{1}, r, e_{2})$ )
TransE	$- {∥ \vec{e_{1}} + \vec{r} - \vec{e_{2}} ∥}_{2}$
F	${\vec{r}}^{⊤} \cdot {\vec{e p}}_{12}$
E	$({\vec{e_{1}}}^{⊤} \cdot \vec{r_{s}}) + ({\vec{e_{2}}}^{⊤} \cdot \vec{r_{o}})$
DistMult	${\vec{r}}^{⊤} \cdot (\vec{e_{1}} ∙ \vec{e_{2}})$

Table 2. Table 2: The first four rows compare four models on four datasets using the standard evaluation protocol. The fifth row shows F’s performance using our proposed KBI evaluation protocol. The last two rows reports results of two most-frequent sanity-check baselines.

Model	FB15K		FB15K-237		WN18		NYT+FB
Model	MRR	HITS@10	MRR	HITS@10	MRR	HITS@10	MRR	HITS@10
DistMult	44.70	66.26	34.07	52.93	75.91	94.12	62.48	72.17
E	22.38	34.56	30.71	44.84	2.36	4.78	7.81	19.14
TransE	43.11	71.97	1.88	0.01	37.15	84.96	7.98	44.05
F	33.62	60.20	28.01	64.76	82.95	98.84	89.28	97.84
F (KBI eval)	13.35	17.03	0.0	0.0	0.14	0.20	74.34	80.01
MFreq $(e_{2} \| r^{*})$	24.91	36.03	33.05	47.60	3.10	5.28	0.90	1.56
MFreq $(e_{2} \| e_{1}^{*})$	8.22	15.61	0.01	0.01	0.00	0.00	79.34	94.93

Table 3. Table 3: Original F with old evaluation protocol vs. F (trained OOV vector) with KBI evaluation protocol. Bold means the gold tuple, and italics means that entity-pair isn’t seen in training. (a) Bill Gates is seen with one e 2 subscript 𝑒 2 e_{2} in training – not the gold answer, (b) Tina Fey is seen with two e 2 subscript 𝑒 2 e_{2} s including the gold answer.

$⟨$ Bill Gates, lives in, ? $⟩$	F (old)	F (new)
(Bill Gates, lives in, Seattle)	5.34	5.34
(Bill Gates, lives in, Medina)	0.04	-1.4
(Bill Gates, lives in, New York)	?	-1.4
$⋮, ⋮, ⋮$	?	-1.4
Reciprocal rank	0.5	$\sim$ 0.0

Table 4. Table 4: No. of distinct entities, no. of relations and entity pair OOV rate, i.e., percentage of tuples in test set, whose entity pairs weren’t seen while training.

Dataset	$\| ℰ \|$	$\| ℛ \|$	ep OOV (%)
FB15K	14,951	1,345	68.70
FB15K-237	14,541	237	100.00
WN18	40,943	18	99.52
NYT+FB	24,528	4,111	0.75

Table 5. Table 5: Results on F model after explicitly modeling OOV vectors. OOV training outperforms other baselines, especially for NYT+FB. Results on FB15k-237 not reported, due to 100% entity pair OOV rate.

Model	FB15K		WN18		NYT+FB
Model	MRR	HITS@10	MRR	HITS@10	MRR	HITS@10
F (random)	13.35	17.03	0.14	0.20	74.34	80.01
F (average)	18.27	24.62	0.13	0.16	71.65	76.80
F (trained)	17.94	23.82	0.19	0.24	81.51	93.67

Table 6. Table 6: Change in performance of DM model initialized with corresponding embeddings extracted from DM+F (AS).

Dataset	$Δ$ MRR	$Δ$ HITS@10
FB15K-237	-3.38	-3.71
FB15K	-20.48	-27.22
NYT+FB	-60.94	-69.26
WN18	-19.17	-18.00

Table 7. Table 7: Performance of joint models. AL = additive loss. AS = additive score. DM+F combined with regularized additive loss (RAL) is most robust across all datasets.

	Model	FB15K		FB15K-237		WN18		NYT+FB
	Model	MRR	HITS@10	MRR	HITS@10	MRR	HITS@10	MRR	HITS@10
1	F	17.94	23.82	0.0	0.0	0.19	0.24	81.51	93.67
2	DM	44.70	66.26	34.07	52.93	75.91	94.12	62.48	72.17
3	E+F (AS)	26.24	37.35	29.71	44.39	1.60	4.04	82.46	92.21
4	DM+F (AS)	22.41	35.81	19.81	41.95	41.54	73.32	81.48	93.47
5	DM+E+F (AS)	29.89	42.00	33.65	49.26	22.92	39.26	81.41	91.41
6	DM+F (AL)	37.61	59.0	26.77	49.77	73.95	93.22	82.28	95.63
7	DM+F (RAL)	45.81	67.64	33.38	53.24	74.55	93.46	82.28	95.63
8	DM+F (Oracle)	49.42	69.00	34.07	52.93	75.95	94.16	86.06	95.73

Table 8. Table 8: Performance segregated by OOV and non-OOV test queries on FB15k. DM+F (RAL) matches best models for both OOV and non-OOV.

Model	OOV		Non-OOV
Model	MRR	HITS	MRR	HITS
F	0.01	0	57.33	75.98
DM	36.9	58.07	61.82	84.25
DM+F (AS)	14.69	29.79	39.37	49.04
DM+F (RAL)	38.06	59.54	62.84	85.42

Equations10

p^{M} (e_{2} ∣ r, e_{1}; θ) = \frac{exp ( ϕ ^{M} ( e _{1} , r , e _{2} ; θ ))}{\sum _{⟨ e_{1}, r, e_{2}^{'} ⟩ \in N e g (e_{1}, r)} exp ( ϕ ^{M} ( e _{1} , r , e _{2}^{'} ; θ ))}

p^{M} (e_{2} ∣ r, e_{1}; θ) = \frac{exp ( ϕ ^{M} ( e _{1} , r , e _{2} ; θ ))}{\sum _{⟨ e_{1}, r, e_{2}^{'} ⟩ \in N e g (e_{1}, r)} exp ( ϕ ^{M} ( e _{1} , r , e _{2}^{'} ; θ ))}

L_{l l}^{M} (T, θ) = - [\sum_{⟨ e_{1}, r, e_{2} ⟩ \in T} lo g p^{M} (e_{1} ∣ r, e_{2}; θ) + \sum_{⟨ e_{1}, r, e_{2} ⟩ \in T} lo g p^{M} (e_{2} ∣ r, e_{1}; θ)]

L_{l l}^{M} (T, θ) = - [\sum_{⟨ e_{1}, r, e_{2} ⟩ \in T} lo g p^{M} (e_{1} ∣ r, e_{2}; θ) + \sum_{⟨ e_{1}, r, e_{2} ⟩ \in T} lo g p^{M} (e_{2} ∣ r, e_{1}; θ)]

L_{mm}^{M} (T, θ) = t \in T \sum t^{'} \in N e g (t) \sum [γ + ϕ^{M} (t^{'}) - ϕ^{M} (t)]_{+}

L_{mm}^{M} (T, θ) = t \in T \sum t^{'} \in N e g (t) \sum [γ + ϕ^{M} (t^{'}) - ϕ^{M} (t)]_{+}

E_{2}

E_{2}

L^{D M + F} (θ^{D M}, θ^{F}) = L^{D M} (θ^{D M}) + L^{F} (θ^{F}) + λ θ^{F}_{2}

L^{D M + F} (θ^{D M}, θ^{F}) = L^{D M} (θ^{D M}) + L^{F} (θ^{F}) + λ θ^{F}_{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Tensor decomposition and applications · Recommender Systems and Techniques

Full text

Joint Matrix-Tensor Factorization for Knowledge Base Inference

Prachi Jain1, Shikhar Murty1 , Mausam1, and Soumen Chakrabarti2

1Indian Institute of Technology Delhi

2Indian Institute of Technology Bombay First two authors contributed equally to the paper

Abstract

While several matrix factorization (MF) and tensor factorization (TF) models have been proposed for knowledge base (KB) inference, they have rarely been compared across various datasets. Is there a single model that performs well across datasets? If not, what characteristics of a dataset determine the performance of MF and TF models? Is there a joint TF+MF model that performs robustly on all datasets? We perform an extensive evaluation to compare popular KB inference models across popular datasets in the literature. In addition to answering the questions above, we remove a limitation in the standard evaluation protocol for MF models, propose an extension to MF models so that they can better handle out-of-vocabulary (OOV) entity pairs, and develop a novel combination of TF and MF models. We also analyze and explain the results based on models and dataset characteristics. Our best model is robust, and obtains strong results across all datasets.

1 Introduction

Inference over knowledge bases (KBs) has received significant attention within NLP research in the last decade. Most of the early works on this task focus on adapting probabilistic formalisms such as Markov Logic Networks and Bayesian Logic Programs for inferring new KB facts [Schoenmackers et al., 2008, Niu et al., 2012, Raghavan et al., 2012]. The formalisms require a set of inference rules as input, which can be generated automatically using statistical regularities in KBs [Schoenmackers et al., 2010, Berant et al., 2011, Nakashole et al., 2012, Jain and Mausam, 2016].

Recent research on this task has integrated the two components of rule learning and fact inference into one joint deep learning framework. This eschews explicit representation and learning of inference rules, and instead employs a way to score a (possibly new) KB fact $(e_{1},r,e_{2})$ directly. Various algorithms differ in their scoring functions, which score a KB fact using different model assumptions.

This line of research can be further subdivided into two broad categories: matrix factorization and tensor factorization . In both cases the models learn one or more embeddings of the relation $r$ , however, they differ in their treatment of entities $e_{1}$ and $e_{2}$ . Tensor factorization (TF) approaches (e.g., E [Riedel et al., 2013], TransE [Bordes et al., 2013], DistMult [Yang et al., 2015], Rescal [Nickel et al., 2011] models) learn independent embeddings for $e_{1}$ and $e_{2}$ , whereas matrix factorization (MF) methods (e.g., F [Riedel et al., 2013] model) learn an embedding per entity-pair $(e_{1},e_{2})$ . Except for one paper making some early progress [Singh et al., 2015], their relative benefits have not been studied in detail.

More importantly, MF and TF have been rarely compared on the same datasets. In particular, three popular KBs are commonly used for TF research (WN18, FB15K, FB15K-237) and one for MF research (NYT+FB, New York Times articles annotated with Freebase entities), but rarely has a model been tested on all four. To the best of our knowledge, no paper reports the performance of E and F models on WN18 or FB15K, TransE on FB15K-237 or NYT+FB, and DistMult on NYT+FB.

Contributions:

We unify several closely related tasks into KB inference (KBI) from a combination of incomplete KBs and text corpus. Our goal is to design inference algorithms that work robustly across diverse input combinations and datasets.

To that end, we first compare E, TransE, F and DistMult (DM) models on all four datasets. The comparison reveals that subtle issues arise in the design of training and evaluation procedures when TF methods are compared against or combined with MF methods. Special care is needed to handle out-of-vocabulary (OOV) entity-pairs during evaluation. Otherwise an MF algorithm may appear to perform better than it really does, as in the case of F’s performance on FB15K-237 [Toutanova et al., 2015].

In response, we present the first unified KBI evaluation protocol that can meaningfully compare MF and DM approaches across several datasets. F’s performance deteriorates using the KBI evaluation protocol. The main reason is an ad hoc handling of OOV entity-pairs by F. We then propose an enhancement of F that explicitly learns OOV entity-pair vectors. This significantly improves F’s performance, but DistMult (DM) remains the most robust solution across all datasets.

Further analysis shows that datasets associated with TF approaches have high OOV-rate in most test folds, naturally resulting in F performing poorly. However, F performs well on the dataset with low OOV rate. Our final contribution is a robust joint algorithm combining DM and F, which is competitive with both models on all datasets, and also outperforms the joint models proposed earlier.

Along with the above results, we contribute open-source implementations111https://github.com/dair-iitd/kbi of all the methods and testing protocols investigated.

2 Background and Experimental Setup

We propose knowledge base inference (KBI) as a task that unifies several closely related tasks in prior work, particularly, knowledge base completion (KBC), link prediction, and relation extraction (RE). In KBC and link prediction, new tuples are inferred from an incomplete structured KB. In RE, relations are inferred between entities mentioned in an unstructured corpus. It is natural [Toutanova et al., 2015] to unify these paradigms, along with textual tuples from OpenIE [Etzioni et al., 2011].

Specifically, we are given an incomplete KB that consists of a set of entities $\mathcal{E}$ and relations $\mathcal{R}$ . $\mathcal{R}$ may contain only semantic relations, only textual relations or a combination of both, as we want inference to benefit from structural regularities among unnormalized and canonical relations, even if these are not reconciled. The KB also contains $\mathcal{T}$ , a set of known valid tuples $t\in\mathcal{T}$ . A tuple $t=\langle e_{1},r,e_{2}\rangle$ consists of a subject entity $e_{1}\in\mathcal{E}$ , object entity $e_{2}\in\mathcal{E}$ , and relation $r\in\mathcal{R}$ . We use a shorthand $ep_{12}$ to refer to entity pair $(e_{1},e_{2})$ . Our goal is to predict the validity of any new tuple not present in the KB.

Our focus is on the numerous neural models $M$ that learn distributed representations (embedding vectors $\in\mathbb{R}^{d}$ ) of entities and relations. At a high level, each model defines a way to compute a score for the tuple $\langle e_{1},r,e_{2}\rangle$ based on some factorization. There are two broad categories of factorization models — tensor factorization (TF) and matrix factorization (MF). Both these kinds of models learn one (or more) embedding of $r$ denoted by $\vec{r}$ . However, they differ in their treatment of entities. TF models learn embeddings for each entity $\vec{e_{1}}$ and $\vec{e_{2}}$ , whereas MF models learn a single embedding for each entity pair $\vec{ep}_{12}$ .222Some models may also learn matrix embeddings instead of vectors [Nickel et al., 2011, Socher et al., 2013]. We don’t study these, as they are typically outperformed by the models implemented in this paper [Yang et al., 2015, Trouillon et al., 2016].

Different models differ primarily in the function $\phi^{M}(e_{1},r,e_{2})$ that combines these embeddings to score a tuple. A higher value of $\phi^{M}$ denotes a model’s higher confidence that the tuple is valid. Table 1 lists the scoring functions used by four popular models, which are the focus of our paper. These are E, F [Riedel et al., 2013], TransE [Bordes et al., 2013], and DistMult [Yang et al., 2015]. Of these, F is an MF model, since it uses the $\vec{ep}_{12}$ embeddings, while the rest three are TF models. Note that E learns two embedding vectors $\vec{r_{s}}$ and $\vec{r_{o}}$ for a relation $r$ . DM uses an element-wise multiplication $\bullet$ in its scoring function.

Our choice of these models is guided by the fact that these algorithms either form the basis of several recent papers on KB inference or are popular baselines for comparison studies [Toutanova et al., 2015, Trouillon et al., 2016, Demeester et al., 2016, Rocktäschel et al., 2015, Verga et al., 2016b, Verga et al., 2016a, Singh et al., 2015].

Loss functions:

The models are trained such that tuples observed in the KB have higher scores than unobserved ones. Several loss functions have been proposed; we implement two common ones in this work: log-likelihood based loss and max margin loss. Both loss functions sample a negative set $Neg(e_{1},r)$ for every tuple, computed as $\{\langle e_{1},r,e_{2}^{\prime}\rangle|e_{2}^{\prime}\in{\cal E}\wedge\langle e_{1},r,e_{2}^{\prime}\rangle\notin\mathcal{T}\}$ , i.e., tuples formed by uniformly sampling entities that are not apriori known to be valid. Similarly, the set $Neg(r,e_{2})$ is sampled.

To define a log-likelihood based loss for $M$ , Toutanova et al. [Toutanova et al., 2015] first model an approximate333For a rigorous estimate, we need to include the numerator also in the denominator, and correct the denominator by the ratio of population to sample size. conditional probability:

[TABLE]

Here $\theta$ represents model parameters: the embeddings for each relation and entity (or entity pair). $p^{M}(e_{1}|r,e_{2};\theta)$ is estimated similarly using $Neg(r,e_{2})$ . The log-likelihood loss to minimize is

[TABLE]

On the other hand, max-margin loss minimizes a margin-based ranking criterion [Bordes et al., 2013]:

[TABLE]

where $t=\langle e_{1},r,e_{2}\rangle$ , $Neg(t)=Neg(e_{1},r)\cup Neg(r,e_{2})$ , $\gamma$ is the margin and $[x]_{+}=\max\{0,x\}$ .

Finally, note that since MF models operate over entity pairs, they do not need two $Neg$ sets. They use one set where new entity pairs $(e_{1}^{\prime},e_{2}^{\prime})$ are sampled such that $\langle e_{1}^{\prime},r,e_{2}^{\prime}\rangle\notin\mathcal{T}$ . These negative entity pairs are sampled only from the entity pairs found in $\mathcal{T}$ , since embeddings for only those pairs get learned.

MF vs. TF Models:

Limited comparisons have been made between the MF and TF families. Toutanova et al. [Toutanova et al., 2015] compare F with some TF models on one dataset and find that F does not perform as well as TF. Singh et al. [Singh et al., 2015] use a series of artificial experiments to conclude that MF models typically perform well on tasks where there is significant relation synonymy in the data, whereas TF models perform better when there are latent types for each relation that need to be predicted. Singh and Toutanova experiment on one real dataset each and show the value of (different) joint MF-TF models on those datasets. We revisit these in Section 5.

2.1 Datasets

Most KB inference systems have used one or more of four popular KBs for evaluation. These include WN18 (eighteen Wordnet relations [Bordes et al., 2013]) and three datasets over Freebase (FB). One dataset is FB15K [Bordes et al., 2013] that has 1,345 relations. Another dataset is FB15K-237, which is a subset of FB15K comprising 237 relations that seldom overlap in terms of entity pairs [Toutanova et al., 2015]. The fourth dataset is NYT+FB, which, along with FB triples, also includes dependency path-based textual relations from New York Times, the mentions of entities in which are aligned with entities in Freebase [Riedel et al., 2013].

Our literature search reveals that no algorithm has been tested on all datasets. To the best of our knowledge, no paper reports results of E and F models on WN18 or FB-15K, TransE on FB15K-237 or NYT+FB, and DistMult on NYT+FB. To better understand the strengths and weaknesses of each model (especially TF vs. MF), we compare all models on all datasets. We also release their open source implementations for further research.

2.2 Standard Evaluation Protocol

Since we wish to run these experiments at scale, we follow one of the common evaluation protocols that can be run completely automatically. This method splits the KB into train ( $\mathcal{T}_{tr}$ ) and test tuples ( $\mathcal{T}_{ts}$ ). The system can access only $\mathcal{T}_{tr}$ during training. For each test tuple, $\langle e_{1}^{*},r^{*},e_{2}^{*}\rangle\in\mathcal{T}_{ts}$ , a query $\langle e_{1}^{*},r^{*},?\rangle$ is issued to the trained model $M$ . The model then ranks all entities $e_{2}\in\mathcal{E}$ by decreasing $\phi^{M}(e_{1}^{*},r^{*},e_{2})$ . A higher rank of $e_{2}^{*}$ in this list suggests a better performance of the model. The metrics used to compare two algorithms are mean reciprocal rank (MRR) and the percentage of $e_{2}^{*}$ s obtained in top 10 results (HITS@10).

The testing procedure is typically run with two modifications. First, it is possible that some of the $e_{2}$ s ranked higher than $e_{2}^{*}$ may form known valid tuples $\langle e_{1}^{*},r^{*},e_{2}\rangle$ — it is unfair to penalize the model for predicting these. The filtered metrics remove the set $\{e_{2}|\langle e_{1}^{*},r^{*},e_{2}\rangle\in\mathcal{T}_{tr}\cup\mathcal{T}_{ts}\}$ from the ranked list [Bordes et al., 2013].

The second modification applies primarily to MF models. In MF, an embedding is learned only for entity pairs that appear in $\mathcal{T}_{tr}$ . Therefore, it is futile to score every $\langle e_{1}^{*},r^{*},e_{2}\rangle$ over a large range of $e_{2}$ s, for most of which, $\vec{ep}_{12}$ is not even known. Instead, only those $e_{2}$ s in a smaller set

[TABLE]

are considered as candidates for ranking [Toutanova et al., 2015, Verga et al., 2016b]. If entity pair $(e_{1}^{*},e_{2})$ is not trained then a random vector is assumed for $\vec{ep}_{1^{*}2}$ .

3 Comparison under Standard and Unified KBI Evaluation Protocols

3.1 Training Details

We first re-implement all algorithms in a common framework written using Keras/Theano [Chollet, 2015, Theano Development Team, 2016]. We use 100 dimensional vectors for all models. They are trained using mini-batch stochastic gradient descent with AdaGrad on K40 GPUs with a learning rate of 0.5. We pre-compute 200 negative samples per tuple. We set margin $\gamma$ to 1 for max margin loss. Following previous work [Yang et al., 2015] all entity and entity-pair vectors are re-normalized to have a unit norm after each batch update. We use a batch size of 20,000 for training. We train all models for 200 epochs. We use early stopping on validation set (a small subset of training set), to prevent our models from overfitting.

We train each model on each dataset using both log-likelihood (LL) and max-margin (MM) loss functions. We pick the best loss function for every setting. In particular, we find that TransE performs much better with MM loss. LL loss works better or at par in all other models except that MM outperforms LL for DistMult on WN18 dataset.

We follow the train-dev-test splits used in previous experiments for FB15K, WN18, and FB15K-237. The testsets $\mathcal{T}_{ts}$ are 3–10% random samples from $\mathcal{T}$ . For NYT+FB, previous works had experimented on a test fold with only 80 correct tuples [Riedel et al., 2013]. Since such a test set is rather small, and in keeping with our other data sets, we create our own train-test splits by randomly sampling about 2% tuples from $\mathcal{T}$ . Only tuples with FB relations are used in the test set similar to previous experiments on this dataset.

3.2 Preliminary Results

The first four rows of Table 2 report the performance of all the models across the datasets. We observe DistMult (DM) to be an overall winner among tensor factorization models – E has good performance on FB15K-237, whereas TransE gets good scores on FB15K, however DM emerges the most robust. For TF models on three datasets (FB15K, FB15K-237, WN18) our experiments are able to replicate (or improve upon) various results reported in prior works [Yang et al., 2015, Bordes et al., 2013, Toutanova et al., 2015].444[Yang et al., 2015] report a higher MRR for DM on WN18. Since NYT+FB is a new test split, and F hasn’t been tested on other datasets, those results can’t be directly compared against previous work.

We also find that F outperforms DM on two datasets by wide margins and doesn’t perform as well as DM on the other two. It appears that a qualitative analysis of DM vs. F will shed light on their relative strengths and weaknesses. Our analysis reveals a limitation in the standard evaluation protocol that can inflate F’s performance scores for OOV entity pairs.

3.3 KBI Evaluation Protocol

Recall the second modification from Section 2.2. When ranking possible entities $e_{2}$ using the score $\phi(e_{1}^{*},r^{*},e_{2})$ from MF models, the standard evaluation protocol operates over a subset $E_{2}$ , instead of all entities in $\mathcal{E}$ . This is because many entity pair embeddings $(e_{1}^{*},e_{2})$ are not even trained in the model, and hence their scores will be meaningless. We call these OOV entity pairs. $E_{2}$ contains all entities for which the entity pair $(e_{1}^{*},e_{2})$ is trained. But, additionally, all such $e_{2}^{*}$ s are added to $E_{2}$ that are gold entities for some query $\langle e_{1}^{*},r^{*},?\rangle$ in test set. If these are not trained, a random vector is assumed for them.

Table 3(a) illustrates an extreme case where the gold entity pair (Bill Gates, Medina) is not seen in training, and only one $e_{2}$ (Seattle) is seen with $e_{1}^{*}$ . Here, the MRR for F model will be computed as 0.5 — a gross overestimation! Implicitly, $(e_{1}^{*},e_{2}^{*})$ is getting ranked higher than all other OOV $(e_{1}^{*},e_{2})$ s, whereas they should all be equal. In other words, the mere presence of $\mathcal{T}_{ts}$ in Eqn (4) leaks information.

Ideally, an evaluation protocol for KBI, that is tolerant to OOV entity pairs, must assume all OOV entities at the same rank and output the average value over all possible rankings for them. In our enhanced protocol, we assume one random OOV entity pair vector $(e_{1}^{*},e_{oov})$ , identify all $e_{2}\in\mathcal{E}$ that are OOV, assign them all the same score from the model and compute aggregate scores based on all possible rankings of such OOV entities. In our example of Table 3(a), the MRR will be computed as the average of $\frac{1}{2},\frac{1}{3}$ , …which is a very small number.

We note that most existing MF models have been tested on test splits in which none of the gold entity pairs are OOV (except FB15k-237). Hence, the results reported in most previous papers are not affected by our proposed fix. Even otherwise, if variants of MF models are being compared among themselves, while they may overestimate performance somewhat, the relative ordering of various models may not be affected. On the other hand, OOVs become a central issue when MF models are compared against or combined with TF models, since realistic levels of sparsity are very different in the two models. We elaborate on this below.

3.4 Results Adjusted for KBI Evaluation

When the KBI evaluation protocol is used, F’s performance on all datasets drops drastically, to the extent that its performance is practically zero on two datasets, and extremely weak on the third. However, it continues to have the best numbers for NYT+FB. Our evaluation sanitizes the published numbers for F on FB15K-237 [Toutanova et al., 2015].

Why is there such a significant drop in F’s scores? The answer lies in entity pair OOV rates for these datasets, i.e., the percentage of tuples in test set whose entity pairs were not seen while training. Table 4 reports some statistics about the datasets as well as their test sets. We notice that FB15K, FB15K-237 and WN18 all have a very high OOV rates, which is strongly correlated with poor performance of F. On the other hand, NYT+FB has a tiny OOV rate and F performs well on it.

Indeed, it is obvious that if the gold entity pair is not even seen while training, an MF model won’t be able to predict it, since it learns each entity-pair vector separately. On the other hand, a TF model, by virtue of learning each entity vector separately (single entity OOVs are very infrequent in these datasets), could combine its knowledge of each individual entity for predicting unseen entity-pairs. Singh et al. [Singh et al., 2015] contribute some theoretical differences between MF and TF models (see Section 2). Our analysis on the basis of entity-pair OOVs adds to that understanding. Moreover, we believe that OOVs, and more generally, data sparsity, offer a more practical insight into differences between two model types — representation in MF necessitates more data points per entity pair, whereas TF is more robust to sparse datasets.

Why does DM model perform the best? While we do not have a conclusive answer to this question, we believe that two reasons could act in DM’s favor. First, like F, DM also has a representation of an entity pair. However, rather than associating an opaque single vector with each entity pair (where the role of individual entities cannot be identified), DM composes the entity-pair vector using entity vectors, as $\vec{e}_{1}\bullet\vec{e}_{2}$ . Thus, it is likely able to exploit some power of matrix factorization, while still being robust to data sparsity. Secondly, even TransE can be seen as composing an entity-pair vector ( $\vec{e_{1}}-\vec{e_{2}}$ ), but it is additive, whereas DM is multiplicative. Previous work on word vectors has shown that multiplicative scores often outperform additive ones as they amplify smaller differences and reduce larger ones [Levy and Goldberg, 2014, Stanovsky et al., 2015].

3.5 Most-Frequent Baselines

To improve our understanding of the difficulty of each dataset and the quality of each model, we introduce two baselines for our task. Given a query, $\langle e_{1}^{*},r^{*},?\rangle$ our first baseline ranks all entities based on the frequency of their occurrence with relation $r^{*}$ , i.e., it orders each entity $e_{2}$ based on the cardinality of the set $\{t|t=\langle e_{1},r^{*},e_{2}\rangle\wedge t\in\mathcal{T}_{tr}\}$ . A similar baseline orders each entity $e_{2}$ based on its frequency of occurence with $e_{1}^{*}$ , i.e., based on cardinality of the set $\{t|t=\langle e_{1}^{*},r,e_{2}\rangle\wedge t\in\mathcal{T}_{tr}\}$ . We name these baselines MFreq $(e_{2}|r^{*})$ and MFreq $(e_{2}|e_{1}^{*})$ respectively. Our motivation to introduce these is to check whether existing models are able to learn beyond such simple baselines or not.

The last two rows of Table 2 report the performance of these baselines. It is satisfying to see that for FB15K and WN18 datasets, DM outperforms the baselines by large margins. However, for FB15K-237, DM is only marginally better than MFreq $(e_{2}|r^{*})$ . A closer analysis reveals that this dataset is constructed so that there is minimal entity-pair overlap between relations. Thus, how would any model predict the best $e_{2}$ for a query $\langle e_{1}^{*},r^{*},?\rangle$ ? If entity pairs haven’t been repeated much, a natural approach may just find the most frequent entities seen with the relation and order based on frequency. We checked some high MRR predictions made by DM and found that often questions like, what is the language of a specific website were answered correctly as English. This is likely not because DM figured out the language of each website, but because English was the most frequent one.

We also observe that E’s performance remains broadly similar to the performance of MFreq $(e_{2}|r^{*})$ . We attribute this to E’s scoring function, since given $e_{1}^{*}$ and $r^{*}$ , the only term relevant for ranking $e_{2}$ s is $\vec{e_{2}}^{\top}\cdot\vec{r}_{o}$ , i.e., the model looks for compatibility with $r^{*}$ and ignores $e_{1}^{*}$ completely.

Finally, for NYT+FB, MFreq $(e_{2}|e_{1}^{*})$ beats F model significantly suggesting that while F is the best model on that dataset, it is not good enough. We explore this further in the next section.

4 OOV Training for KB Inference

The previous section highlights the importance of OOV entity-pairs in the performance of MF models. In general, a robust model must gracefully handle unseen entities/entity-pairs. A natural extension is to explicitly model an OOV entity-pair vector for F model (and OOV entity vector for TF models). In particular, we represent a vector $(e_{oov},e_{oov})$ vector for F and $e_{oov}$ for TF.555We also tried learning several entity pair OOV vectors of the form $(e_{1},e_{oov})$ , but that didn’t give us a better performance. This modification means that OOV entity-pairs will have the same score.

OOV vectors can be trained in many ways. We develop two baselines that don’t train the vectors explicitly. One baseline assigns a random value to $(e_{oov},e_{oov})$ . Another is an average baseline that computes $(e_{oov},e_{oov})$ as the average of the vectors of all $(e_{1},e_{2})$ pairs that occur only once in training.

We also propose a procedure to train the OOV vectors. The high-level motivation is that we wish to score a known tuple higher than a tuple with an OOV. To ensure this, we add $(e_{oov},e_{oov})$ in the $Neg$ set for each train tuple. This encourages the model to learn embeddings such that $\phi^{F}(e_{1},r,e_{2})>\phi^{F}(e_{oov},r,e_{oov})$ . Thus, we ensure that the performance of F is maintained when the gold entity pair is seen in training. Table 3(b) illustrates an example where the correct answer (New York) is seen with Tina Fey and OOV training doesn’t displace its position. For a TF model, we follow an analogous procedure to train an OOV vector $\vec{e}_{oov}$ .

**Results: ** Since the fractions of OOV entities ( $e_{2}^{*}$ s) in the testsets are rather small, OOV training doesn’t benefit TF models much. However, it makes substantial improvements in F’s performance. Table 5 compares trained OOV embeddings to the two baselines for F. We find that training of OOVs overall performs better (or at par) with averaging baseline. F’s score improves tremendously on NYT+FB, to the extent that it is able to beat the MFreq $(e_{2}|e_{1}^{*})$ baseline by a small margin. We conclude that OOV training is essential for realizing the full potential of MF models.

5 Joint MF-TF Models

Background on Joint MF-TF Models: Recall that Singh et al. [Singh et al., 2015] compare TF and MF models (particularly, E and F) and find that they have complementary strengths. In response they develop joint TF-MF models and find that they outperform individual models on artificial datasets and NYT+FB. Their best model (E+F) uses the scoring function $\phi^{E+F}=\sigma(\phi^{E}+\phi^{F})$ , where $\sigma$ is the sigmoid function. We call this model an additive score (AS) joint model, since the scores of two models are added. Early works of Reidel et al. [Riedel et al., 2013] also experiment with a joint model for NYT+FB. Later, Toutanova et al. [Toutanova et al., 2015] implement a joint E+DM+F model and tested it on FB15K-237 but no other datasets.

We are motivated by developing a model that is robust across all datasets. Do additive score E+F or additive score E+DM+F meet this requirement?

Additive loss (AL) joint model: Our goal is to develop one joint model that can at least match the performance of the best individual model for each dataset. We focus on joint DM+F models.

Preliminary investigations reveal that additive score models can suffer substantial loss in performance on some datasets. Table 6 shows drop in performance in the DM component when trained jointly in additive score DM+F model. It clearly shows that DM’s performance can reduce drastically due to joint training. A primary reason is that F scores overshadow DM (and E) scores.666To calibrate them, we tried standardizing scores obtained from pre-trained models. We also tried to learn a slope and bias to push DM and F model scores to the same range simultaneously. We also tried sharing of relation parameters to allow information to flow from DM to MF. Unfortunately, none of the approaches were robust across datasets. Moreover, the number of parameters in MF models (vectors for entity pairs) significantly outnumber those in TF models (vectors for entities). This can lead to significant overfitting.

In response, we develop a different class of joint models in which instead of adding the scores ( $\phi$ s), we add their loss functions: $\mathcal{L}^{DM+F}=\mathcal{L}^{DM}+\mathcal{L}^{F}$ . We name these additive loss joint models (AL). We expect this to be more resilient to overshadowing, since the joint loss expects each model’s individual loss to decrease as much as possible. One may note that AL style of training is equivalent to training the models separately. However, joint training makes other extensions possible, such as regularization.

Regularized additive loss (RAL):

We extend the vanilla AL joint model to a regularized joint model in which the parameters of MF model are L2-regularized. We expect this regularization to encourage a reduction in overfitting caused due to the large number of MF parameters. Overall, our final joint model has the loss function:

[TABLE]

At test time, for a query $\langle e_{1}^{*},r^{*},?\rangle$ an AL model cannot simply add the scores, since some entity-pairs may be OOVs. We develop various backoff cases, reminiscent of traditional backoff in language models [Manning and Schütze, 2001]. For every $e_{2}$ :

•

Case 1: $(e_{1}^{*},e_{2})\in\mathcal{T}_{tr}$ . Score of tuple is $\phi^{DM}(e_{1}^{*},r^{*},e_{2})+\phi^{F}(e_{1}^{*},r^{*},e_{2})$ .

•

Case 2: $(e_{1}^{*},e_{2})\notin\mathcal{T}_{tr}$ , but $e_{2}$ is seen in training. Score of tuple is $\phi^{DM}(e_{1}^{*},r^{*},e_{2})+\phi^{F}(e_{oov},r^{*},e_{oov})$ .

•

Case 3: $e_{2}$ is not seen in training. Score of tuple is $\phi^{DM}(e_{1}^{*},r^{*},e_{oov})+\phi^{F}(e_{oov},r^{*},e_{oov})$ .

Results: Table 7 compares the performance of individual models with joint models. Regularization penalty $\lambda$ is chosen over a small devset from within the training set. All joint models are trained using both max-margin and log-likelihood losses, and we report the better of the two.

We find that different additive score models (rows 3–5) perform well on some datasets, but are not robust across them. For example, in FB15K none of these are able to match up to DM’s performance. We attribute this to overfitting by F, which makes the model believe that $\phi^{F}$ is predicting the tuple very well. This lets F override TF and reduces the joint model’s need to learn the best TF model(s). Note that row 3 and row 5 are the models reported in [Singh et al., 2015] and [Toutanova et al., 2015], respectively.

Rows 6 and 7 report the results of additive loss DM+F models, both without and with regularization. As anticipated, adding the losses improves performance since both models get trained well. Moreover, regularization also helps considerably since now the model is not overwhelmed by too many F parameters. RAL version of DM+F achieves scores close to the best individual model on each dataset. In some cases, its performance is marginally weaker, and in other cases it is slightly better. Overall, this model has the desired robustness across datasets.

**Analysis: ** Row 8 of Table 7 also shows the accuracy of an oracle model that, for every test query, post-facto selects the model with the more accurate score (between DM and F). This upper bounds the performance expected from a perfect joint DM+F model, fixing the constituents. We find that the oracle is only 3-4 MRR percentage points better than our best model for two datasets, and the differences are much less for the other two. Overall, it suggests that our proposed joint model obtains a strong robust performance.

Table 9 breaks down the performance of models on the subset of test queries that have OOVs and non-OOV gold entity pairs. This analysis is meaningful only for FB15K, since other datasets have extreme entity-pair OOV rates (see Table 4). We observe that while F has extremely poor performance on OOVs (and thus weak performance overall), it performs decently on non-OOVs. RAL DM+F is able to perform well on both OOVs and non-OOVs, whereas DM+F (AS) has poorer performance on both of them (although still better than vanilla F for OOVs). Also note that F is outperformed by DM even on non-OOVs; this refutes prior claims that F always performs better than TF models when test entity pairs are seen during training [Riedel et al., 2013, Toutanova et al., 2015].

6 Discussion and Future Work

We now list two observations that suggest important directions for future research in KB inference.

Dataset Characteristics:

Our work subjects datasets to natural sanity checks. First, we introduce two most frequent baselines (Table 2) to understand the nature of the KBs. Second, we compute entity-pair OOV rates (Table 4) as a rough predictor of the relative success of the TF and MF families. Finally, in Table 9, we report the singleton and doubleton percentages (for entity pairs). A singleton is an entity-pair occurring only once in the data ( $\mathcal{T}_{tr}\cup\mathcal{T}_{ts}$ ) and a doubleton is an entity pair that occurs exactly twice. Doubletons have a strong effect in the scenario painted in Table 3. We find that almost every dataset has some idiosyncrasy, which raises the question whether it is a good representative for the datasets found naturally.

In particular, WN-18 and FB15K-237 have near 100% entity-pair OOV rates, unlikely to be the case in real KBs. In FB15K-237 the best models are not much better than MFreq( $e_{2}|r^{*}$ ) baseline. This is because the dataset is artificially constructed to avoid relations with entity-pair overlap. But, this reduces its ability to make many interesting inferences. For NYT+FB, MFreq( $e_{2}|e_{1}^{*}$ ) performance has a strong performance with 95% score on HITS@10. Moreover, learned models are able to improve its MRR by only about three percentage points. Statistics in Table 9 reveal that this could be because the dataset has an unusually high number of entity-pair doubletons: it is the only data set where doubletons by far outnumber singletons. It is unlikely that such a distribution occurs in a naturally occurring dataset. FB15K appears to pass our sanity tests. We believe that focus on better datasets will likely help us in better progress on KB inference.

Path based inference:

In KBs, a common type of inference is based on relation paths (or Horn-clauses), e.g., (Michael Jordan, teaches at, Berkeley) and (Berkeley, is located in, California) implies (Michael Jordan, teaches in, California). To assess the ability of inference models to automatically learn such relation paths, we tested them on artificial datasets, where we provided many instances of two-hop paths with relations $r_{1}$ and $r_{2}$ implying a third relation $r_{3}$ . We find that none of the four models are effective at predicting such relations. A study similar to ours comparing the latest models that train over relation paths [Guu et al., 2015, García-Durán et al., 2015, Toutanova et al., 2016] will benefit our understanding of path-based inference.

7 Related Work

Traditional methods for inference over KBs include random walks over knowledge graphs [Lao et al., 2011], natural logic inference [MacCartney and Manning, 2007], and use of statistical relational learning models such as Markov Logic Network, Bayesian Logic Programs, and Probabilistic Soft Logic [Schoenmackers et al., 2008, Raghavan et al., 2012, Wang and Cohen, 2015]. These need (or benefit from) a background knowledge of inference rules, predominantly generated via extended distributional similarity [Lin and Pantel, 2001, Schoenmackers et al., 2010, Nakashole et al., 2012, Galárraga et al., 2013, Grycner et al., 2015, Berant et al., 2012, Jain and Mausam, 2016].

Neural methods for KB inference combine both inference and rule learning into one unified framework to add new facts to the KB directly. Both MF and TF methods have been very popular with several extensions proposed for each. The original F model has been extended to incorporate first order logic rules, [Rocktäschel et al., 2015, Demeester et al., 2016], to predict for relations not seen at training time [Verga et al., 2016a], etc. It has also been extended to generate embedding of a new entity-pair on the fly [Verga et al., 2016b]. But that is different from our OOV method, since, at test time, they expect knowledge of several tuples between the same entity pair.

Similarly, other TF models also exist, for example, Parafac [Harshman, 1970], Rescal [Nickel et al., 2011] and NTN [Socher et al., 2013]. These are older models which are shown to be outperformed by models evaluated in this paper. More recent models have also been introduced such as a model using holographic embeddings [Nickel et al., 2016], and another with asymmetric embeddings using complex vectors [Trouillon et al., 2016]. It will be nice to compare these rigorously as well. The learned embeddings can use additional information such as typing [Chang et al., 2014], have been used to mine logical rules [Yang et al., 2015] and have been used for schema induction [Nimishakavi et al., 2016].

8 Conclusion

We extensively evaluate various tensor factorization (TF) and matrix factorization (MF) models for KB inference on all popular datasets. After replacing the standard evaluation protocol with our proposed OOV-cognizant KBI protocol, we find that DistMult (a TF model) is fairly robust across a variety of datasets, but F (an MF model) outperforms others on one dataset. F’s performance increases further by training an OOV entity-pair vector. Finally, we propose joint models that combine DistMult and F. We find that adding the loss functions from both models with a regularization on F’s parameters achieves the most robust results across all datasets.

We also present a series of analyses of our empirical results. First, our work increases our understanding of relative strengths and weaknesses of MF and TF models given some important bulk characteristics of the data sets. Specifically, we establish a strong connection between accuracy of various approaches and the fraction of OOV test entity pairs, and the proportion between entity pair singletons and doubletons. Second, we find that our joint model achieves results at par with the best individual models for both OOV and non-OOV queries. As a by-product, we identify some peculiarities in existing datasets, which suggests a need to design better benchmark datasets.

We release our code for all models and evaluation protocols for further use by research community. In the future, we wish to study models that explicitly incorporate relation paths for KB inference.

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[Berant et al., 2011] Jonathan Berant, Ido Dagan, and Jacob Goldberger. 2011. Global learning of typed entailment rules. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 , pages 610–619. Association for Computational Linguistics.
2[Berant et al., 2012] Jonathan Berant, Ido Dagan, Meni Adler, and Jacob Goldberger. 2012. Efficient tree-based approximation for entailment graph learning. In The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the System Demonstrations, July 10, 2012, Jeju Island, Korea .
3[Bordes et al., 2013] Antoine Bordes, Nicolas Usunier, Alberto García-Durán, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Burges et al. [ Burges et al., 2013 ] , pages 2787–2795.
4[Burges et al., 2013] Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger, editors. 2013. Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States .
5[Chang et al., 2014] Kai-Wei Chang, Wen-tau Yih, Bishan Yang, and Christopher Meek. 2014. Typed tensor decomposition of knowledge bases for relation extraction. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL , pages 1568–1579. ACL.
6[Chollet, 2015] François Chollet. 2015. Keras. https://github.com/fchollet/keras .
7[Demeester et al., 2016] Thomas Demeester, Tim Rocktäschel, and Sebastian Riedel. 2016. Regularizing relation representations by first-order implications. In Jay Pujara, Tim Rocktäschel, Danqi Chen, and Sameer Singh, editors, Proceedings of the 5th Workshop on Automated Knowledge Base Construction, AKBC@NAACL-HLT 2016, San Diego, CA, USA, June 17, 2016 , pages 75–80. The Association for Computer Linguistics.
8[Etzioni et al., 2011] Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland, and Mausam. 2011. Open information extraction: The second generation. In IJCAI , volume 11, pages 3–10.