TL;DR
This paper introduces a novel, model-agnostic diachronic embedding approach for temporal knowledge graph completion, enhancing static models with temporal entity characteristics to improve inference accuracy.
Contribution
It proposes a new diachronic embedding function that can be integrated with any static KG model, demonstrated with SimplE, to effectively handle temporal information.
Findings
The proposed model outperforms existing baselines in experiments.
The embedding function is fully expressive when combined with SimplE.
The approach is model-agnostic and adaptable to various static KG models.
Abstract
Knowledge graphs (KGs) typically contain temporal facts indicating relationships among entities at different times. Due to their incompleteness, several approaches have been proposed to infer new facts for a KG based on the existing ones-a problem known as KG completion. KG embedding approaches have proved effective for KG completion, however, they have been developed mostly for static KGs. Developing temporal KG embedding models is an increasingly important problem. In this paper, we build novel models for temporal KG completion through equipping static models with a diachronic entity embedding function which provides the characteristics of entities at any point in time. This is in contrast to the existing temporal KG embedding approaches where only static entity features are provided. The proposed embedding function is model-agnostic and can be potentially combined with any static…
| Dataset | || | ||||||
|---|---|---|---|---|---|---|---|
| ICEWS14 | 7,128 | 230 | 365 | 72,826 | 8,941 | 8,963 | 90,730 |
| ICEWS05-15 | 10,488 | 251 | 4017 | 386,962 | 46,275 | 46,092 | 479,329 |
| GDELT | 500 | 20 | 366 | 2,735,685 | 341,961 | 341,961 | 3,419,607 |
| ICEWS14 | ICEWS05-15 | GDELT | ||||||||||
| Model | MRR | Hit@1 | Hit@3 | Hit@10 | MRR | Hit@1 | Hit@3 | Hit@10 | MRR | Hit@1 | Hit@3 | Hit@10 |
| TransE | 0.280 | 9.4 | - | 63.7 | 0.294 | 9.0 | - | 66.3 | 0.113 | 0.0 | 15.8 | 31.2 |
| DistMult | 0.439 | 32.3 | - | 67.2 | 0.456 | 33.7 | - | 69.1 | 0.196 | 11.7 | 20.8 | 34.8 |
| SimplE | 0.458 | 34.1 | 51.6 | 68.7 | 0.478 | 35.9 | 53.9 | 70.8 | 0.206 | 12.4 | 22.0 | 36.6 |
| ConT | 0.185 | 11.7 | 20.5 | 31.5 | 0.163 | 10.5 | 18.9 | 27.2 | 0.144 | 8.0 | 15.6 | 26.5 |
| TTransE | 0.255 | 7.4 | - | 60.1 | 0.271 | 8.4 | - | 61.6 | 0.115 | 0.0 | 16.0 | 31.8 |
| HyTE | 0.297 | 10.8 | 41.6 | 65.5 | 0.316 | 11.6 | 44.5 | 68.1 | 0.118 | 0.0 | 16.5 | 32.6 |
| TA-DistMult | 0.477 | 36.3 | - | 68.6 | 0.474 | 34.6 | - | 72.8 | 0.206 | 12.4 | 21.9 | 36.5 |
| DE-TransE | 0.326 | 12.4 | 46.7 | 68.6 | 0.314 | 10.8 | 45.3 | 68.5 | 0.126 | 0.0 | 18.1 | 35.0 |
| DE-DistMult | 0.501 | 39.2 | 56.9 | 70.8 | 0.484 | 36.6 | 54.6 | 71.8 | 0.213 | 13.0 | 22.8 | 37.6 |
| DE-SimplE | 0.526 | 41.8 | 59.2 | 72.5 | 0.513 | 39.2 | 57.8 | 74.8 | 0.230 | 14.1 | 24.8 | 40.3 |
| Model | Variation | MRR | Hit@1 | Hit@3 | Hit@10 |
| DE-TransE | No variation (Activation function: Sine) | 0.326 | 12.4 | 46.7 | 68.6 |
| DE-DistMult | No variation (Activation function: Sine) | 0.501 | 39.2 | 56.9 | 70.8 |
| DE-DistMult | Activation function: Tanh | 0.486 | 37.5 | 54.7 | 70.1 |
| DE-DistMult | Activation function: Sigmoid | 0.484 | 37.0 | 54.6 | 70.6 |
| DE-DistMult | Activation function: Leaky ReLU | 0.478 | 36.3 | 54.2 | 70.1 |
| DE-DistMult | Activation function: Squared Exponential | 0.501 | 39.0 | 56.8 | 70.9 |
| DE-TransE | Diachronic embedding for both entities and relations | 0.324 | 12.7 | 46.1 | 68.0 |
| DE-DistMult | Diachronic embedding for both entities and relations | 0.502 | 39.4 | 56.6 | 70.4 |
| DistMult | Generalizing to unseen timestamps | 0.410 | 30.2 | 46.2 | 62.0 |
| DE-DistMult | Generalizing to unseen timestamps | 0.452 | 34.5 | 51.3 | 65.4 |
| DE-DistMult | for for all | 0.458 | 34.4 | 51.8 | 68.3 |
| DE-DistMult | for for all | 0.470 | 36.4 | 53.1 | 67.1 |
| DE-DistMult | for for all | 0.498 | 38.9 | 56.2 | 70.4 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Diachronic Embedding for Temporal Knowledge Graph Completion
Rishab Goel , Seyed Mehran Kazemi∗, Marcus Brubaker, Pascal Poupart
Borealis AI
{rishab.goel,mehran.kazemi,marcus.brubaker,pascal.poupart}@borealisai.com Equal contribution.
Abstract
Knowledge graphs (KGs) typically contain temporal facts indicating relationships among entities at different times. Due to their incompleteness, several approaches have been proposed to infer new facts for a KG based on the existing ones–a problem known as KG completion. KG embedding approaches have proved effective for KG completion, however, they have been developed mostly for static KGs. Developing temporal KG embedding models is an increasingly important problem. In this paper, we build novel models for temporal KG completion through equipping static models with a diachronic entity embedding function which provides the characteristics of entities at any point in time. This is in contrast to the existing temporal KG embedding approaches where only static entity features are provided. The proposed embedding function is model-agnostic and can be potentially combined with any static model. We prove that combining it with SimplE, a recent model for static KG embedding, results in a fully expressive model for temporal KG completion. Our experiments indicate the superiority of our proposal compared to existing baselines.
1 Introduction
Knowledge graphs (KGs) are directed graphs where nodes represent entities and (labeled) edges represent the types of relationships among entities. Each edge in a KG corresponds to a fact and can be represented as a tuple such as where and are called the head and tail entities respectively and is a relation. An important problem, known as KG completion, is to infer new facts from a KG based on the existing ones. This problem has been extensively studied for static KGs (see [46, 62, 44] for a summary). KG embedding approaches have offered state-of-the-art results for KG completion on several benchmarks. These approaches map each entity and each relation type to a hidden representation and compute a score for each tuple by applying a score function to these representations. Different approaches differ in how they map the entities and relation types to hidden representations and in their score functions.
To capture the temporal aspect of the facts, KG edges are typically associated with a timestamp or time interval; e.g., . However, KG embedding approaches have been mostly designed for static KGs ignoring the temporal aspect. Recent work has shown a substantial boost in performance by extending these approaches to utilize time [21, 10, 41, 16]. The proposed extensions are mainly through computing a hidden representation for each timestamp and extending the score functions to utilize timestamp representations as well as entity and relation representations.
In this paper, we develop models for temporal KG completion (TKGC) based on an intuitive assumption: to provide a score for, e.g., , one needs to know ’s and ’s features on ; providing a score based on their current features may be misleading. That is because ’s personality and the sentiment towards may have been quite different on as compared to now. Consequently, learning a static representation for each entity – as is done by existing approaches – may be sub-optimal as such a representation only captures the entity features at the current time, or an aggregation of entity features during time.
To provide entity features at any given time, we define entity embedding as a function which takes an entity and a timestamp as input and provides a hidden representation for the entity at that time. Inspired by diachronic word embeddings, we call our proposed embedding diachronic embedding (DE). DE is model-agnostic: any static KG embedding model can be potentially extended to TKGC by leveraging DE. We prove that combining DE with SimplE [25] results in a fully expressive model for TKGC. To the best of our knowledge, this is the first TKGC model with a proof of fully expressiveness. We show the merit of our model on subsets of ICEWS [5] and GDELT [38] datasets.
2 Background and Notation
Notation: Lower-case letters denote scalars, bold lower-case letters denote vectors, and bold upper-case letters denote matrices. represents the element of a vector , represents its norm, and represents its transpose. For two vectors and , represents the concatenation of the two vectors. represents a vector such that (i.e. the flattened vector of the tensor/outer product of the two vectors). For vectors of the same length , represents the sum of the element-wise product of the elements of the vectors.
Temporal Knowledge Graph (Completion): Let be a finite set of entities, be a finite set of relation types, and be a finite set of timestamps. Let represent the set of all temporal tuples that are facts (i.e. true in a world), where , , and . Let be the complement of . A temporal knowledge graph (KG) is a subset of (i.e. ). Temporal KG completion (TKGC) is the problem of inferring from .
Relation Properties: A relation is symmetric if and anti-symmetric if . A relation is the inverse of another relation if . entails if .
KG Embedding: Formally, we define an entity embedding as follows.
Definition 1**.**
An entity embedding, , is a function which maps every entity to a hidden representation in where is the class of non-empty tuples of vectors and/or matrices.
A relation embedding () is defined similarly. We refer to the hidden representation of an entity (relation) as the embedding of the entity (relation). A KG embedding model defines two things: 1- the and functions, 2- a score function which takes and as input and provides a score for a given tuple. The parameters of hidden representations are learned from data.
3 Existing Approaches
In this section, we describe the existing approaches for static and temporal KG completion that will be used in the rest of the paper. For further detail on temporal KG completion approaches, we refer the reader to a recent survey [27]. We represent the score for a tuple by .
TransE (static) [4]: In TransE, for every where , for every where , and .
DistMult (static) [64]: Same and as TransE but defining .
Tucker (static) [60, 2]: Same and as TransE but defining where is a weight vector shared for all tuples.
RESCAL (static) [45]: Same as TransE but defining for every where , and defining .
Canonical Polyadic (CP) (static) [19]: Same as TransE but defining \mathtt{EEMB}(\mathsf{v})=(\vec{\boldsymbol{z}}_{\mathsf{v}},\reflectbox{\vec{\reflectbox{}}}_{\mathsf{v}}) for every where \vec{\boldsymbol{z}}_{\mathsf{v}},\reflectbox{\vec{\reflectbox{}}}_{\mathsf{v}}\in\mathbb{R}^{d}. is used when is the head and \reflectbox{\vec{\reflectbox{}}}_{\mathsf{v}} is used when is the tail. In CP, \phi(\mathsf{v},\mathsf{r},\mathsf{u})=\langle\vec{\boldsymbol{z}}_{\mathsf{v}},\boldsymbol{z}_{\mathsf{r}},\reflectbox{\vec{\reflectbox{}}}_{\mathsf{u}}\rangle. DistMult is a special case of CP where \vec{\boldsymbol{z}}_{\mathsf{v}}=\reflectbox{\vec{\reflectbox{}}}_{\mathsf{v}} for every .
SimplE (static) [25]: Noticing an information flow issue between the two vectors and \reflectbox{\vec{\reflectbox{}}}_{\mathsf{v}} of an entity in CP, Kazemi and Poole [25] take advantage of the inverse of the relations to address this issue. They define \mathtt{REMB}(\mathsf{r})=(\vec{\boldsymbol{z}}_{\mathsf{r}},\reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}}) for every , where is used as in CP and \reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}}\in\mathbb{R}^{d} is considered the embedding of , the inverse of . In SimplE, is defined as the average of two CP scores: 1- \langle\vec{\boldsymbol{z}}_{\mathsf{v}},\vec{\boldsymbol{z}}_{\mathsf{r}},\reflectbox{\vec{\reflectbox{}}}_{\mathsf{u}}\rangle corresponding to the score for and 2- \langle\vec{\boldsymbol{z}}_{\mathsf{u}},\reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}},\reflectbox{\vec{\reflectbox{}}}_{\mathsf{v}}\rangle corresponding to the score for . A similar extension of CP has been proposed in [34].
TTransE (temporal) [21]: An extension of TransE by adding one more embedding function mapping timestamps to hidden representations: for every where . In TTransE, .
HyTE (temporal) [10]: Same , and as TTransE but defining where for . Intuitively, HyTE first projects the head, relation, and tail embeddings to the space of the timestamp and then applies the TransE function on the projected embeddings.
ConT (temporal) [41]: Ma et al. [41] extend several static KG embedding models to TKGC. Their best performing model, ConT, is an extension of Tucker defining for every where and changing the score function to . Intuitively, ConT replaces the shared vector in Tucker with timestamp embeddings .
TA-DistMult (temporal) [16]: An extension of DistMult where each character in the timestamps is mapped to a vector () where . Then, for a tuple , a temporal relation is created by considering and the characters in as a sequence and an embedding is computed for this temporal relation by feeding the embedding vectors for each element in the sequence to an LSTM and taking its final output. Finally, the score function of DistMult is employed: (TransE was employed as well but DistMult performed better).
4 Diachronic Embedding
According to Definition 1, an entity embedding function takes an entity as input and provides a hidden representation as output. We propose an alternative entity embedding function which, besides entity, takes time as input as well. Inspired by diachronic word embeddings, we call such an embedding function a diachronic entity embedding. Below is a formal definition of a diachronic entity embedding.
Definition 2**.**
A diachronic entity embedding, , is a function which maps every pair , where and , to a hidden representation in where is the class of non-empty tuples of vectors and/or matrices.
One may take their favorite static KG embedding score function and make it temporal by replacing their entity embeddings with diachronic entity embeddings. The choice of the function can be different for various temporal KGs depending on their properties. Here, we propose a function which performs well on our benchmarks. We give the definition for models where the output of the function is a tuple of vectors but it can be generalized to other cases as well. Let be a vector in (i.e. ). We define as follows:
[TABLE]
where and are (entity-specific) vectors with learnable parameters and is an activation function. Intuitively, entities may have some features that change over time and some features that remain fixed. The first elements of the vector in Equation (1) capture temporal features and the other elements capture static features. is a hyper-parameter controlling the percentage of temporal features. While in Equation (1) static features can be potentially obtained from the temporal ones if the optimizer sets some elements of to zero, explicitly modeling static features helps reduce the number of learnable parameters and avoid overfitting to temporal signals (see Section 5.2).
Intuitively, by learning s and s, the model learns how to turn entity features on and off at different points in time so accurate temporal predictions can be made about them at any time. s control the importance of the features. We mainly use sine as the activation function for Equation (1) because one sine function can model several on and off states. Our experiments explore other activation functions as well and provide more intuition.
Model-Agnosticism: The proposals in existing temporal KG embedding models can only extend one (or a few) static models to temporal KGs. As an example, it is not trivial how RESCAL can be extended to temporal KGs using the proposal in [16] (except for the naive approach of expecting the LSTM to output large matrices) or in [21, 10]. Same goes for models other than RESCAL where the relation embeddings contain matrices (see, e.g., [43, 53, 39]). Using our proposal, one may construct temporal versions of TransE, DistMult, SimplE, Tucker, RESCAL, or other models by replacing their function with in Equation 1. We refer to the resulting models as DE-TransE, DE-DistMult, DE-SimplE and so forth, where DE is short for Diachronic Embedding.
Learning: The facts in a KG are split into , , and sets. Model parameters are learned using stochastic gradient descent with mini-batches. Let be a mini-batch. For each fact , we generate two queries: 1- and 2- . For the first query, we generate a candidate answer set which contains and (hereafter referred to as negative ratio) other entities selected randomly from . For the second query, we generate a similar candidate answer set . Then we minimize the cross entropy loss which has been used and shown good results for both static and temporal KG completion (see, e.g., [22, 16]):
[TABLE]
4.1 Expressivity
Expressivity is an important property and has been the subject of study in several recent works on static (knowledge) graphs [6, 59, 25, 63, 2, 15]. If a model is not expressive enough, it is doomed to underfitting for some applications. A desired property of a model is fully expressiveness:
Definition 3**.**
A model with parameters is fully expressive if given any world with true tuples and false tuples , there exists an assignment for that correctly classifies the tuples in and .
For static KG completion, several models have been proved to be fully expressive. For TKGC, however, a proof of fully expressiveness does not yet exist for the proposed models. The following theorem establishes the fully expressiveness of DE-SimplE. The proof can be found in Appendix A.
Theorem 1** (Expressivity).**
DE-SimplE is fully expressive for temporal knowledge graph completion.
4.2 Domain Knowledge
For several static KG embedding models, it has been shown how certain types of domain knowledge (if exists) can be incorporated into the embeddings through parameter sharing (aka tying) and how it helps improve model performance (see, e.g., [25, 55, 42, 15]). Incorporating domain knowledge for these static models can be ported to their temporal version when they are extended to temporal KGs through our diachronic embeddings. As a proof of concept, we show how incorporating domain knowledge into SimplE can be ported to DE-SimplE. We chose SimplE for our proof of concept as several types of domain knowledge can be incorporated into it.
Consider with \mathtt{REMB}(\mathsf{r}_{i})=(\vec{\boldsymbol{z}}_{\mathsf{r}_{i}},\reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{i}}) (according to SimplE). If is known to be symmetric or anti-symmetric, this knowledge can be incorporated into the embeddings by tying to \reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{i}} or negation of \reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{i}} respectively [25]. If is known to be the inverse of , this knowledge can be incorporated into the embeddings by tying to \reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{j}} and to \reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{i}} [25].
Proposition 1**.**
Symmetry, anti-symmetry, and inversion can be incorporated into DE-SimplE in the same way as SimplE.
If is known to entail , Fatemi et al. [15] prove that if entity embeddings are constrained to be non-negative, then this knowledge can be incorporated by tying to and \reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{j}} to \reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{i}}+\reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{j}} where and \reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{j}} are vectors with non-negative elements. We give a similar result for DE-SimplE.
Proposition 2**.**
By constraining s in Equation (1) to be non-negative for all and to be an activation function with a non-negative range (such as ReLU, sigmoid, or squared exponential), entailment can be incorporated into DE-SimplE in the same way as SimplE.
Compared to the result in Fatemi et al. [15], the only added constraint for DE-SimplE is that the activation function in Equation (1) is also constrained to have a non-negative range. Proofs for Propositions 1 and 2 can be found in Appendix A.
5 Experiments & Results
Datasets: Our datasets are subsets of two temporal KGs that have become standard benchmarks for TKGC: ICEWS [5] and GDELT [38]. For ICEWS, we use the two subsets generated by García-Durán et al. [16]: 1- ICEWS14 corresponding to the facts in 2014 and 2- ICEWS05-15 corresponding to the facts between 2005 to 2015. For GDELT, we use the subset extracted by Trivedi et al. [56] corresponding to the facts from April 1, 2015 to March 31, 2016. We changed the train/validation/test sets following a similar procedure as in [4] to make the problem into a TKGC rather than an extrapolation problem. Table 1 provides a summary of the dataset statistics.
Baselines: Our baselines include both static and temporal KG embedding models. From the static KG embedding models, we use TransE and DistMult and SimplE where the timing information are ignored. From the temporal KG embedding models, we use the ones introduced in Section 2.
Metrics: For each fact , we create two queries: 1- and 2- . For the first query, the model ranks all entities in where . This corresponds to the filtered setting commonly used in the literature [4]. We follow a similar approach for the second query. Let and represent the ranking for and for the two queries respectively. We report mean reciprocal rank (MRR) defined as . Compared to its counterpart mean rank which is largely influenced by a single bad prediction, MRR is more stable [46]. We also report Hit@1, Hit@3 and Hit@10 measures where Hit@k is defined as , where is if holds and [math] otherwise.
Implementation111Code and datasets are available at https://github.com/BorealisAI/DE-SimplE: We implemented our model and the baselines in PyTorch [49]. We ran our experiments on a node with four GPUs. For the two ICEWS datasets, we report the results for some of the baselines from [16]. For the other experiments on these datasets, for the fairness of results, we follow a similar experimental setup as in [16] by using the ADAM optimizer [30] and setting learning rate , batch size , negative ratio , embedding size , and validating every epochs selecting the model giving the best validation MRR. Following the best results obtained in [41] (and considering the memory restrictions), for ConT we set embedding size , batch size on ICEWS14 and GDELT and on ICEWS05-15. We validated dropout values from . We tuned for our model from the values . For GDELT, we used a similar setting but with a negative ratio due to the large size of the dataset. Unless stated otherwise, we use as the activation function for Equation (1). Since the timestamps in our datasets are dates rather than single numbers, we apply the temporal part of Equation (1) to year, month, and day separately (with different parameters) thus obtaining three temporal vectors. Then we take an element-wise sum of the resulting vectors obtaining a single temporal vector. Intuitively, this can be viewed as converting a date into a timestamp in the embedded space.
5.1 Comparative Study
We compare the baselines with three variants of our model: 1- DE-TransE, 2- DE-DistMult, and 3- DE-SimplE. The obtained results in Table 2 indicate that the large number of parameters per timestamp makes ConT perform poorly on ICEWS14 and ICEWS05-15. On GDELT, it shows a somewhat better performance as GDELT has many training facts in each timestamp. Besides affecting the predictive performance, the large number of parameters makes training ConT extremely slow. According to the results, the temporal versions of different models outperform the static counterparts in most cases, thus providing evidence for the merit of capturing temporal information.
DE-TransE outperforms the other TransE-based baselines (TTransE and HyTE) on ICEWS14 and GDELT and gives on-par results with HyTE on ICEWS05-15. This result shows the superiority of our diachronic embeddings compared to TTransE and HyTE. DE-DistMult outperforms TA-DistMult, the only DistMult-based baseline, showing the superiority of our diachronic embedding compared to TA-DistMult. Moreover, DE-DistMult outperforms all TransE-based baselines. Finally, just as SimplE beats TransE and DistMult due to its higher expressivity, our results show that DE-SimplE beats DE-TransE, DE-DistMult, and the other baselines due to its higher expressivity.
Previously, each of the existing models was tested on different subsets of ICEWS and GDELT and a comprehensive comparison of them did not exist. As a side contribution, Table 2 provides a comparison of these approaches on the same benchmarks and under the same experimental setting. The results reported in Table 2 may be directly used for comparison in future works.
5.2 Model Variants & Ablation Study
We run experiments on ICEWS14 with several variants of the proposed models to provide a better understanding of them. The results can be found in Table 3 and Figure 1. Table 3 includes DE-TransE and DE-DistMult with no variants as well so other variants can be easily compared to them.
Activation Function: So far, we used sine as the activation function in Equation 1. The performance for other activation functions including Tanh, sigmoid, Leaky ReLU (with leakage), and squared exponential are presented in Table 3. From the table, it can be viewed that other activation functions also perform well. Specifically, squared exponential performs almost on-par with sine. We believe one reason why sine and squared exponential give better performance is because a combination of sine or square exponential features can generate more sophisticated features than a combination of Tanh, sigmoid, or ReLU features. While a temporal feature with Tanh or sigmoid as the activation corresponds to a smooth off-on (or on-off) temporal switch, a temporal feature with sine or squared exponential activation corresponds to two (or more) switches (e.g., off-on-off) which can potentially model relations that start at some time and end after a while (e.g., ). These results also provide evidence for the effectiveness of diachronic embedding across several functions.
Adding Diachronic Embedding for Relations: Compared to entities, we hypothesize that relations may evolve at a very lower rate or, for some relations, evolve only negligibly. Therefore, modeling relations with a static (rather than a diachronic) representation may suffice. To test this hypothesis, we ran DE-TransE and DE-DistMult on ICEWS14 where relation embeddings are also a function of time. From the obtained results in Table 3, one can see that the model with diachronic embeddings for both entities and relations performs on-par with the model with diachronic embedding only for entities. We conducted the same experiment on ICEWS05-15 (which has a longer time horizons) and GDELT and observed similar results. These results show that at least on our benchmarks, modeling the evolution of relations may not be helpful. Future work can test this hypothesis on datasets with other types of relations and longer horizons.
Generalizing to Unseen Timestamps: To measure how well our models generalize to timestamps not observed in train set, we created a variant of the ICEWS14 dataset by including every fact except those on the , , and day of each month in the train set. We split the excluded facts randomly into validation and test sets (removing the ones including entities not observed in the train set). This ensures that none of the timestamps in the validation or test sets has been observed by the model in the train set. Then we ran DistMult and DE-DistMult on the resulting dataset. The obtained results in Table 3 indicate that DE-DistMult gains almost MRR improvement over DistMult thus showing the effectiveness of our diachronic embedding to generalize to unseen timestamps.
Importance of Model Parameters Used in Equation 1: In Equation 1, the temporal part of the embedding contains three components: , , and . To measure the importance of each component, we ran DE-DistMult on ICEWS14 under three settings: 1- when s are removed (i.e. set to ), 2- when s are removed (i.e. set to ), and 3- when s are removed (i.e. set to [math]). From the obtained results presented in Table 3, it can be viewed that all three components are important for the temporal features, especially s and s. Removing s does not affect the results as much as s and s. Therefore, if one needs to reduce the number of parameters, removing may be a good option as long as they can tolerate a slight reduction in accuracy.
Static Features: Figure 1(a) shows the test MRR of DE-SimplE on ICEWS14 as a function of , the percentage of temporal features. According to Figure 1(a), as soon as some features become temporal (i.e. changes from [math] to a non-zero number), a substantial boost in performance can be observed. This observation sheds more light on the importance of learning temporal features and having diachronic embeddings. As becomes larger, MRR reaches a peak and then slightly drops. This slight drop in performance can be due to overfitting to temporal cues. This result demonstrates that modeling static features explicitly can help reduce the number of learnable parameters and avoid overfitting. Such a design choice may be even more important when the embedding dimensions are larger. However, it comes at the cost of adding one hyper-parameter to the model. If one prefers a slightly less accurate model with fewer hyper-parameters, they can make all vector elements temporal.
Training Curve: Figure 1(b) shows the training curve for DistMult and DE-DistMult on ICEWS14. While it has been argued that using sine activation functions may complicate training in some neural network architectures (see, e.g., [37, 17]), it can be viewed that when using sine activations, the training curve for our model is quite stable.
6 Related Work
StaRAI: Statistical relational AI (StaRAI) [50, 31] approaches are mainly based on soft (hanf-crafted or learned) rules [51, 11, 29, 26] where the probability of a world is typically proportional to the number of rules that are satisfied/violated in that world and the confidence for each rule. A line of work in this area combines a stack of soft rules with embeddings for property prediction [54, 24]. Another line of work extends the soft rules to temporal KGs [52, 48, 14, 20, 9, 8]. The approaches based on soft rules have been generally shown to perform subpar to KG embedding models [46].
Graph Walk: These approaches define weighted template walks on a KG and then answer queries by template matching [35, 36]. They have been shown to be quite similar to, and in some cases subsumed by, the models based on soft rules [23].
Static KG Embedding: A large number of models have been developed for static KG embedding. A class of these models are the translational approaches corresponding to variations of TransE (see, e.g., [39, 61, 43]). Another class of approaches are based on a bilinear score function each imposing a different sparsity constraint on the matrices (see, e.g., [45, 58, 47, 25, 40]). A third class of models are based on deep learning approaches using feed-forward or convolutional layers on top of the embeddings (see, e.g., [53, 13, 12, 1]). These models can be potentially extended to TKGC through our diachronic embedding.
Temporal KG Embedding: Several works have extended the static KG embedding models to temporal KGs. Jiang et al. [21] extend TransE by adding atimestamp embedding into the score function. Dasgupta et al. [10] extend TransE by projecting the embeddings to the timestamp hyperplain and then using the TransE score on the projected space. Ma et al. [41] extend several models by adding a timestamp embedding to their score functions. These models may not work well when the number of timestamps is large. Furthermore, since they only learn embeddings for observed timestamps, they cannot generalize to unseen timestamps. García-Durán et al. [16] extend TransE and DistMult by combining the relation and timestamp through a character LSTM. These models have been described in detail in Section 2 and their performances have been reported in Table 2.
KG Embedding for Extrapolation: TKGC is an interpolation problem where given a set of temporal facts in a time frame, the goal is to predict the missing facts. A related problem is the extrapolation problem where future interactions are to be predicted (see, e.g., [56, 33, 57]). Despite some similarities in the employed approaches, KG extrapolation is fundamentally different from TKGC in that a score for an interaction is to be computed given only the past (i.e. facts before ) whereas in TKGC the score is to be computed given past, present, and future. A comprehensive analysis of the existing models for interpolation and extrapolation can be found in [27].
Diachronic Word Embeddings: The idea behind our proposed embeddings is similar to diachronic word embeddings where a corpus is typically broken temporally into slices (e.g., 20-year chuncks of a 200-year corpus) and embeddings are learned for words in each chunk thus providing word embeddings that are a function of time (see, e.g., [28, 32, 18, 3]). The main goal of diachronic word embeddings is to reveal how the meanings of the words have evolved over time. Our work can be viewed as an extension of diachronic word embeddings to continuous-time KG completion.
7 Conclusion
Temporal knowledge graph (KG) completion is an important problem and has been the focus of several recent studies. We developed a diachronic embedding function for temporal KG completion which provides a hidden representation for the entities of a temporal KG at any point in time. Our embedding is generic and can be combined with any score function. We proved that combining our diachronic embedding with SimplE results in a fully expressive model – the first temporal KG embedding model for which such a result exists. We showed the superior performance of our model compared to existing work on several benchmarks. Future work includes designing functions other than the one proposed in Equation 1, a comprehensive study of which functions are favored by different types of KGs, and using our proposed embedding for diachronic word embedding.
Appendix A Proof of Theorems and Propositions
Theorem 1**.**
DE-SimplE is fully expressive for temporal knowledge graph completion.
Proof.
For every entity , let \mathtt{DEEMB}(\mathsf{v}_{i},t)=(\vec{\boldsymbol{z}}^{\mathsf{t}}_{\mathsf{v}_{i}},\reflectbox{\vec{\reflectbox{}}}^{\mathsf{t}}_{\mathsf{v}_{i}}) where, according to Equation 1 with sine activations, and \reflectbox{\vec{\reflectbox{}}}^{\mathsf{\,t}}_{\mathsf{v}_{i}}\in\mathbb{R}^{d} are defined as follows:
[TABLE]
and:
[TABLE]
We provide the proof for a specific case of DE-SimplE where the elements of s are all temporal and the elements of \reflectbox{\vec{\reflectbox{}}}^{\mathsf{\,t}}_{\mathsf{v}}s are all non-temporal. This specific case can be achieved by setting , and \reflectbox{\vec{\reflectbox{}}}_{\mathsf{v}}[n]=0 and \reflectbox{\vec{\reflectbox{}}}_{\mathsf{v}}[n]=\frac{\pi}{2} for all and for all . If this specific case of DE-SimplE is fully expressive, so is DE-SimplE. In this specific case, and \reflectbox{\vec{\reflectbox{}}}^{\mathsf{t}}_{\mathsf{v}_{i}} for every can be re-written as follows:
[TABLE]
For every relation , let \mathtt{REMB}(\mathsf{r})=(\vec{\boldsymbol{z}}_{\mathsf{r}_{j}},\reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{j}}). To further simplify the proof, following [25], we only show how the embedding values can be set such that \langle\vec{\boldsymbol{z}}^{\mathsf{\>t}}_{\mathsf{v}_{i}},\vec{\boldsymbol{z}}_{\mathsf{r}_{j}},\reflectbox{\vec{\reflectbox{}}}^{\mathsf{\,t}}_{\mathsf{v}_{k}}\rangle becomes a positive number if and a negative number if . Extending the proof the case where the score contains both components (\langle\vec{\boldsymbol{z}}^{\mathsf{\>t}}_{\mathsf{v}_{i}},\vec{\boldsymbol{z}}_{\mathsf{r}_{j}},\reflectbox{\vec{\reflectbox{}}}^{\mathsf{\,t}}_{\mathsf{v}_{k}}\rangle and \langle\vec{\boldsymbol{z}}^{\mathsf{\>t}}_{\mathsf{v}_{k}},\reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{j}},\reflectbox{\vec{\reflectbox{}}}^{\mathsf{\,t}}_{\mathsf{v}_{i}}\rangle) can be done by doubling the size of the embedding vectors and following a similar procedure as the one explained below for the second half of the vectors.
Assume where is a natural number. These vectors can be viewed as blocks of size . For the relation , let be zero everywhere except on the block where it is everywhere. With such a value assignment to s, to find the score for a fact , only the block of each embedding vector is important. Let us now focus on the block.
The size of the block (similar to all other blocks) is and it can be viewed as sub-blocks of size . For the entity , let the values of be zero in all sub-blocks except the sub-block. With such a value assignment, to find the score for a fact , only the sub-block of the block is important. Note that this sub-block is unique for each tuple . Let us now focus on the sub-block of the block.
The size of the sub-block of the block is and it can be viewed as sub-sub-blocks of size . According to the Fourier sine series [7], with a large enough , we can set the values for , , and in a way that the sum of the elements of for the sub-sub-block becomes when (where is the timestamp in ) and [math] when is a timestamp other than . Note that this sub-sub-block is unique for each tuple .
Having the above value assignments, if , we set all the values in the sub-sub-block of the sub-block of the block of \reflectbox{\vec{\reflectbox{}}}_{\mathsf{v}_{k}} to . With this assignment, \langle\vec{\boldsymbol{z}}^{\mathsf{\>t}}_{\mathsf{v}_{i}},\vec{\boldsymbol{z}}_{\mathsf{r}_{j}},\reflectbox{\vec{\reflectbox{}}}^{\mathsf{\,t}}_{\mathsf{v}_{k}}\rangle=1 at . If , we set all the values for the sub-sub-block of the sub-block of the block of \reflectbox{\vec{\reflectbox{}}}_{\mathsf{v}_{k}} to . With this assignment, \langle\vec{\boldsymbol{z}}^{\mathsf{\>t}}_{\mathsf{v}_{i}},\vec{\boldsymbol{z}}_{\mathsf{r}_{j}},\reflectbox{\vec{\reflectbox{}}}^{\mathsf{\,t}}_{\mathsf{v}_{k}}\rangle=-1 at . ∎
Proposition 1**.**
Symmetry, anti-symmetry, and inversion can be incorporated into DE-SimplE in the same way as SimplE.
Proof.
Let with \mathtt{REMB}(\mathsf{r}_{i})=(\vec{\boldsymbol{z}}_{\mathsf{r}_{i}},\reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{i}}) be symmetric. According to DE-SimplE, for a fact we have:
[TABLE]
where gives the DE-SimplE score for a fact, and \reflectbox{\vec{\reflectbox{}}}^{\mathsf{\,t}}_{\mathsf{v}} are two vectors assigned to (according to SimplE) both defined according to Equation 1, and and \reflectbox{\vec{\reflectbox{}}}^{\mathsf{\,t}}_{\mathsf{u}} are two vectors assigned to both defined according to Equation 1. Moreover, for a fact we have:
[TABLE]
By tying to \reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{i}}, the two scores become identical. Therefore, tying to \reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{i}} ensures that the score for is the same as the score for thus ensuring the symmetry of . With the same argument, if is tied to -\reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{i}}, then one score becomes the negation of the other score so only one of them can be true.
Assume with \mathtt{REMB}(\mathsf{r}_{j})=(\vec{\boldsymbol{z}}_{\mathsf{r}_{j}},\reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{j}}) is known to be the inverse of . Then the score for a fact is as in Equation (6) and for is as follows:
[TABLE]
By tying to \reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{i}} and \reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{j}} to , the score in Equation (8) can be re-written as:
[TABLE]
This score is identical to the score in Equation (6). Therefore, tying to \reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{i}} and \reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{j}} to ensures and are the inverse of each other. ∎
Proposition 2**.**
By constraining s in Equation (1) to be non-negative for all and to be an activation function with a non-negative range (such as ReLU, sigmoid, or squared exponential), entailment can be incorporated into DE-SimplE in the same way as SimplE.
Proof.
Let with \mathtt{REMB}(\mathsf{r}_{i})=(\vec{\boldsymbol{z}}_{\mathsf{r}_{i}},\reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{i}}) and with \mathtt{REMB}(\mathsf{r}_{j})=(\vec{\boldsymbol{z}}_{\mathsf{r}_{j}},\reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{j}}) be two distinct relations such that entails . For a fact , the score according to DE-SimplE is as in Equation (6), and for , the score is as follows:
[TABLE]
By tying to and \reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{j}} to \reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{i}}+\reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{j}}, where and \reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{j}} are vectors with non-negative elements (thus, making this tying scheme equivalent to two inequality constraints), the score in Equation (10) can be re-written as:
[TABLE]
The constraints imposed on the elements of \vec{\boldsymbol{z}}^{\mathsf{\>t}}_{\mathsf{v}},\reflectbox{\vec{\reflectbox{}}}^{\mathsf{\,t}}_{\mathsf{v}},\vec{\boldsymbol{z}}^{\mathsf{\>t}}_{\mathsf{u}}, and \reflectbox{\vec{\reflectbox{}}}^{\mathsf{\,t}}_{\mathsf{u}} ensure that all elements of these vectors are non-negative. Furthermore, and \reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{j}} have also been constrained to be non-negative. Therefore, \langle\vec{\boldsymbol{z}}^{\mathsf{\>t}}_{\mathsf{v}},\vec{\boldsymbol{\delta}}_{\mathsf{r}_{j}},\reflectbox{\vec{\reflectbox{}}}^{\mathsf{\,t}}_{\mathsf{u}}\rangle and \langle\vec{\boldsymbol{z}}^{\mathsf{\>t}}_{\mathsf{u}},\reflectbox{\vec{\reflectbox{}}}_{\mathsf{r}_{j}},\reflectbox{\vec{\reflectbox{}}}^{\mathsf{\,t}}_{\mathsf{v}}\rangle are both non-negative resulting in:
[TABLE]
Since , the probability of being true according to DE-SimplE is greater than or equal to the probability of being true thus ensuring the entailment of the relations. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Balazevic et al. [2018] Ivana Balazevic, Carl Allen, and Timothy M Hospedales. Hypernetwork knowledge graph embeddings. ar Xiv preprint ar Xiv:1808.07018 , 2018.
- 2Balažević et al. [2019] Ivana Balažević, Carl Allen, and Timothy M Hospedales. Tucker: Tensor factorization for knowledge graph completion. ar Xiv preprint ar Xiv:1901.09590 , 2019.
- 3Bamler and Mandt [2017] Robert Bamler and Stephan Mandt. Dynamic word embeddings. In ICML , pages 380–389, 2017.
- 4Bordes et al. [2013] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In Neur IPS , pages 2787–2795, 2013.
- 5Boschee et al. [2015] Elizabeth Boschee, Jennifer Lautenschlager, Sean O’Brien, Steve Shellman, James Starz, and Michael Ward. Icews coded event data. Harvard Dataverse , 12, 2015.
- 6Buchman and Poole [2016] David Buchman and David Poole. Negation without negation in probabilistic logic programming. In KR , 2016.
- 7Carslaw [1921] Horatio Scott Carslaw. Introduction to the Theory of Fourier’s Series and Integrals . Macmillan, 1921.
- 8Chekol and Stuckenschmidt [2018] Melisachew Wudage Chekol and Heiner Stuckenschmidt. Rule based temporal inference. In ICLP . Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.
