Towards Time-Aware Distant Supervision for Relation Extraction

Tianwen Jiang; Sendong Zhao; Jing Liu; Jin-Ge Yao; Ming Liu; Bing Qin,; Ting Liu; Chin-Yew Lin

arXiv:1903.03289·cs.CL·March 11, 2019

Towards Time-Aware Distant Supervision for Relation Extraction

Tianwen Jiang, Sendong Zhao, Jing Liu, Jin-Ge Yao, Ming Liu, Bing Qin,, Ting Liu, Chin-Yew Lin

PDF

Open Access

TL;DR

This paper introduces Time-DS, a time-aware framework for relation extraction that leverages timestamp information to reduce noise from distant supervision, significantly improving extraction accuracy on news data.

Contribution

The paper proposes a novel time-aware distant supervision framework incorporating instance-popularity and two strategies, hard filter and curriculum learning, to enhance relation extraction.

Findings

01

Time-DS outperforms baseline methods on news corpus.

02

Instance-popularity effectively reduces noisy labels.

03

Curriculum learning improves relation extraction accuracy.

Abstract

Distant supervision for relation extraction heavily suffers from the wrong labeling problem. To alleviate this issue in news data with the timestamp, we take a new factor time into consideration and propose a novel time-aware distant supervision framework (Time-DS). Time-DS is composed of a time series instance-popularity and two strategies. Instance-popularity is to encode the strong relevance of time and true relation mention. Therefore, instance-popularity would be an effective clue to reduce the noises generated through distant supervision labeling. The two strategies, i.e., hard filter and curriculum learning are both ways to implement instance-popularity for better relation extraction in the manner of Time-DS. The curriculum learning is a more sophisticated and flexible way to exploit instance-popularity to eliminate the bad effects of noises, thus get better relation extraction…

Tables5

Table 1. Table 1. Statistics of the supervision knowledge and annotated dataset. The statistics of the supervision knowledge is reported in the “#Relation Instances” column. “Original Set” is the whole original training set, i.e., the training set produced by DS. The other filtered training sets (“ ≥ 0.1 absent 0.1 \geq 0.1 ” to “ ≥ 0.6 absent 0.6 \geq 0.6 ”) are produced by hard filter, based on different threshold of instance-popularity .

Relation	#Relation Instances		Training Set #Sentences							Test Set #Sentence
Relation	for Training	for Test	Original Set ( $\geq 0.0$ )	$\geq 0.1$	$\geq 0.2$	$\geq 0.3$	$\geq 0.4$	$\geq 0.5$	$\geq 0.6$	Test Set #Sentence
Acquisition	30	8	39,365	9,905	8,361	6,963	6,388	5,825	4,664	694
Investing	46	11	2,741	1,227	239	146	142	141	138	48
JobChange	71	12	188,945	35,154	25,177	21,622	18,655	16,452	13,507	905
Lawsuit	15	3	16,503	2,147	1,588	1,334	944	854	697	313
Partnership	12	5	1,408	794	794	794	794	794	259	503
\hdashlinetotal	174	39	248,872	49,224	36,156	30,859	26,923	24,066	19,265	2,463/376

Table 2. Table 2. Macro-average of precision, recall and F1 of relation classification model, which is trained on the Original Set and the other six filtered training sets respectively.

Training Set	macro-P(%)	macro-R(%)	macro-F1(%)
the Basic DS
\hdashlineOriginal Set	$67.62 \pm 0.27$	$78.62 \pm 0.26$	$72.71 \pm 0.25$
Time-DS with Hard Filter
\hdashlineInsPo $\geq 0.1$	$70.72 \pm 0.39$	$77.04 \pm 0.20$	$73.74 \pm 0.27$
InsPo $\geq 0.2$	$72.73 \pm 0.33$	$75.96 \pm 0.28$	$74.31 \pm 0.27$
InsPo $\geq 0.3$	$76.38 \pm 0.23$	$77.65 \pm 0.27$	$77.01 \pm 0.20$
InsPo $\geq 0.4$	$71.74 \pm 0.29$	$74.30 \pm 0.25$	$73.00 \pm 0.22$
InsPo $\geq 0.5$	$72.35 \pm 0.39$	$75.27 \pm 0.33$	$73.78 \pm 0.31$
InsPo $\geq 0.6$	$74.27 \pm 0.37$	$77.76 \pm 0.32$	$75.97 \pm 0.31$

Table 3. Table 3. Macro-average of precision, recall and F1 of relation classification model trained on Original Set with curriculum learning.

Training Set	macro-P(%)	macro-R(%)	macro-F1(%)
the Basic DS
\hdashlineOriginal Set	$67.62 \pm 0.27$	$78.62 \pm 0.26$	$72.71 \pm 0.25$
Time-DS with Curriculum Learning
\hdashlineInsPo $\geq 0.0$ (7th round)	$69.09 \pm 0.39$	$77.93 \pm 0.24$	$73.24 \pm 0.29$
InsPo $\geq 0.1$ (6th round)	$72.41 \pm 0.33$	$79.76 \pm 0.42$	$75.91 \pm 0.31$
InsPo $\geq 0.2$ (5th round)	$75.75 \pm 0.44$	$76.50 \pm 0.24$	$76.12 \pm 0.28$
InsPo $\geq 0.3$ (4th round)	$79.04 \pm 0.29$	$77.73 \pm 0.24$	$78.38 \pm 0.20$
InsPo $\geq 0.4$ (3rd round)	$76.53 \pm 0.25$	$75.52 \pm 0.26$	$76.02 \pm 0.19$
InsPo $\geq 0.5$ (2nd round)	$78.34 \pm 0.28$	$76.06 \pm 0.29$	$77.18 \pm 0.21$
InsPo $\geq 0.6$ (1st round)	$74.27 \pm 0.37$	$77.76 \pm 0.32$	$75.97 \pm 0.31$

Table 4. Table 4. Noise ratio and set scale of the different training sets.

	Orig. Set	$\geq 0.1$	$\geq 0.2$	$\geq 0.3$	$\geq 0.4$	$\geq 0.5$	$\geq 0.6$
Noise Ratio	0.66	0.37	0.33	0.23	0.37	0.21	0.18
Set Scale	248,872	49,224	36,156	30,859	26,923	24,066	19,265

Table 5. Table 5. Instance-popularity cases of some relation instances, and their establishment time. Due to the limited space, we only display the month when instance-popularity peaks and merge three continuous days to be displayed in one cell.

Equations16

C (r (e_{i}, e_{j})) = \frac{∣ { r u l e ∣ matches r ( e _{i} , e _{j} )} ∣}{Z _{r u l e}} + \frac{∣ { m ∣ mention_of r ( e _{i} , e _{j} )} ∣}{Z _{m}}

C (r (e_{i}, e_{j})) = \frac{∣ { r u l e ∣ matches r ( e _{i} , e _{j} )} ∣}{Z _{r u l e}} + \frac{∣ { m ∣ mention_of r ( e _{i} , e _{j} )} ∣}{Z _{m}}

InsPo_{r (e_{i}, e_{j})}^{t} = \frac{∣ A _{r (e_{i}, e_{j})}^{t} ∣}{∣ Ω _{r (e_{i}, e_{j})} ∣}

InsPo_{r (e_{i}, e_{j})}^{t} = \frac{∣ A _{r (e_{i}, e_{j})}^{t} ∣}{∣ Ω _{r (e_{i}, e_{j})} ∣}

∣ Ω_{r (e_{i}, e_{j})} ∣ = \frac{1}{L} n = 1 \sum N ∣ A_{r (e_{i}, e_{j})}^{t_{n}} ∣

∣ Ω_{r (e_{i}, e_{j})} ∣ = \frac{1}{L} n = 1 \sum N ∣ A_{r (e_{i}, e_{j})}^{t_{n}} ∣

∣ A^{'}_{r (e_{i}, e_{j})}^{t} ∣ = k = 1 \sum K ∣ A_{r (e_{i}, e_{j})}^{t} ∣ \cdot p_{P_{k}}^{t}

∣ A^{'}_{r (e_{i}, e_{j})}^{t} ∣ = k = 1 \sum K ∣ A_{r (e_{i}, e_{j})}^{t} ∣ \cdot p_{P_{k}}^{t}

∣ Ω^{'}_{r (e_{i}, e_{j})} ∣ = \frac{1}{L} n = 1 \sum N ∣ A^{'}_{r (e_{i}, e_{j})}^{t_{n}} ∣

∣ Ω^{'}_{r (e_{i}, e_{j})} ∣ = \frac{1}{L} n = 1 \sum N ∣ A^{'}_{r (e_{i}, e_{j})}^{t_{n}} ∣

∣ A_{r (e_{i}, e_{j})}^{t} ∣ \approx \frac{1}{\sum _{k = 1}^{K} p _{P_{k}}} \cdot ∣ A^{'}_{r (e_{i}, e_{j})}^{t} ∣

∣ A_{r (e_{i}, e_{j})}^{t} ∣ \approx \frac{1}{\sum _{k = 1}^{K} p _{P_{k}}} \cdot ∣ A^{'}_{r (e_{i}, e_{j})}^{t} ∣

∣ Ω_{r (e_{i}, e_{j})} ∣ \approx \frac{1}{\sum _{k = 1}^{K} p _{P_{k}}} \cdot ∣ Ω^{'}_{r (e_{i}, e_{j})} ∣

∣ Ω_{r (e_{i}, e_{j})} ∣ \approx \frac{1}{\sum _{k = 1}^{K} p _{P_{k}}} \cdot ∣ Ω^{'}_{r (e_{i}, e_{j})} ∣

InsPo_{r (e_{i}, e_{j})}^{t} \approx \frac{∣ A ^{'} _{r (e_{i}, e_{j})}^{t} ∣}{∣ Ω ^{'} _{r (e_{i}, e_{j})} ∣}

InsPo_{r (e_{i}, e_{j})}^{t} \approx \frac{∣ A ^{'} _{r (e_{i}, e_{j})}^{t} ∣}{∣ Ω ^{'} _{r (e_{i}, e_{j})} ∣}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Web Data Mining and Analysis · Advanced Text Analysis Techniques

Full text

Towards Time-Aware Distant Supervision

for Relation Extraction

Tianwen Jiang

Harbin Institute of Technology

[email protected]

,

Sendong Zhao

Harbin Institute of Technology

[email protected]

,

Jing Liu

Microsoft Research Asia

[email protected]

,

Jin-Ge Yao

Microsoft Research Asia

[email protected]

,

Ming Liu

Harbin Institute of Technology

[email protected]

,

Bing Qin

Harbin Institute of Technology

[email protected]

,

Ting Liu

Harbin Institute of Technology

[email protected]

and

Chin-Yew Lin

Microsoft Research Asia

[email protected]

Abstract.

Distant supervision for relation extraction heavily suffers from the wrong labeling problem. To alleviate this issue in news data with the timestamp, we take a new factor time into consideration and propose a novel time-aware distant supervision framework (Time-DS). Time-DS is composed of a time series instance-popularity and two strategies. Instance-popularity is to encode the strong relevance of time and true relation mention. Therefore, instance-popularity would be an effective clue to reduce the noises generated through distant supervision labeling. The two strategies, i.e., hard filter and curriculum learning are both ways to implement instance-popularity for better relation extraction in the manner of Time-DS. The curriculum learning is a more sophisticated and flexible way to exploit instance-popularity to eliminate the bad effects of noises, thus get better relation extraction performance. Experiments on our collected multi-source news corpus show that Time-DS achieves significant improvements for relation extraction.

relation extraction, distant supervision, time-aware

††copyright: none††article: 4††price: 15.00

1. Introduction

Distant supervision (DS) has become a popular paradigm for relation extraction in recent years (Mintz et al., 2009; Zeng et al., 2015; Zheng et al., 2017). Distant supervision could largely extend annotated training instances through aligning relation instances in knowledge bases (KB) to sentences in text.

However, distant supervision heavily suffers from the wrong labeling problem because the aligned sentences are not necessarily expressing the same relations as the ones in KB(Mintz et al., 2009; Riedel et al., 2010). Such wrong labeling problem introduces many false positive training instances that hurts the performance of the models. Many efforts have been made to alleviate the bad effects of such noises produced by DS. Some studies (Riedel et al., 2010; Surdeanu et al., 2012; Ritter et al., 2013; Min et al., 2013) applied multi-instance learning for relaxing the distant supervision assumption and making the at-least-one assumption: if two entities preserve a relation in a KB, at least one sentence that mentions the entity pair expresses the relation. Nowadays, Some neural networks studies(Zeng et al., 2015; Lin et al., 2016; Feng et al., 2017) learned from multiple instances attentively, without explicitly characterizing the inherent noise. However, all these studies attempted to empower noise-tolerance of models rather than reducing the noises from the source, i.e., the process of distant supervision. Therefore, these studies still suffer from the effects of noises by some degree.

Taking time into consideration can effectively alleviate the noises in DS for relation extraction. We find that relation instances in news data is usually time-sensitive thus not uniformly distributed in news, e.g., Partnership(Microsoft, Facebook) in Figure 1. The mentions of Partnership(Microsoft, Facebook) tend to be concentrated in a certain period of time, i.e. from May 26 to 27 in 2016. Therefore, an intuition is that the automatically annotated dataset produced in such a certain period of time has extremely fewer noises, while mentions in other days are more likely false positives.

To model the above intuition that introducing a new factor time to enhance DS for automatic dataset annotation, we propose a novel time-aware distant supervision framework (Time-DS). Time-DS can effectively reduce the impact of noises by making use of time. Time-DS uses a time series instance-popularity for each relation instance to indicate how many news mentioning the relation every day. The instance-popularity is proposed to encode the strong relevance of time and true relation mention. For better use of the time series instance-popularity, Time-DS considers two strategies.

First, taking instance-popularity as a hard filter to eliminate noises in the process of DS, i.e., aligning. This hard filter sets a hard threshold to filter noisy data, i.e. unreliably aligned sentences. This is a simple strategy with apparent drawbacks. It (1) heavily relies on the threshold of instance-popularity, and (2) unable to utilize the noises in training to make DS-based models more robust. We therefore propose a second strategy to conduct curriculum learning (Bengio et al., 2009) on the weighted training instances, i.e., instances with instance-popularity, which is a slightly more sophisticated but flexible way to exploit instance-popularity. The main idea of curriculum learning is simple: starting with the easiest aspect of a task, and leveling up the difficulty gradually. In this study, we begin with the high-quality annotated sentences and gradually add low-quality sentences into training set according to our proposed instance-popularity. Curriculum learning for Time-DS can make full use of every weighted training instance for better relation extraction performance, while obtaining a robust model to put up with noises.

We conduct experiments of relation extraction on a multi-source news corpus with timestamp, where it is more natural to utilize rich temporal statistics compared with independent single documents only. Meanwhile, the multi-source news corpus is of value for training a DS model for relation extraction from two other aspects. First, the multi-source news corpus contains diverse expressions of the same relations, which is a superior to single-source news corpus. Second, a large number of relation mentions can be obtained with a few relation instance seeds. Both aspects benefit training a powerful and robust DS model. The experimental results show the superiority of our proposed Time-DS for relation extraction. It is worthwhile to highlight our contributions as follows:

•

To alleviate the noises issue of distant supervision in the time-sensitive domain like news data with the timestamp, we take time into consideration.

•

To use time in a sophisticated and flexible way, we use curriculum learning in terms of a time series instance-popularity, which is proved to be effective for noises elimination.

•

A multi-source news corpus with timestamp is collected. Such multi-source corpus is more natural to utilize rich temporal statistics compared with independent single documents only.

2. Problem Statement

In this section, we firstly introduce some concepts used in this paper, then formally define Time-DS for relation extraction.

Definition 1. (Relation Instance) If a relation $r$ holds between two entities $(e_{i},e_{j})$ , we take $r(e_{i},e_{j})$ as a relation instance, such as Partnership(Microsoft, Facebook).

Definition 2. (Relation Mention) For the relation instance $r(e_{i},e_{j})$ , we define relation mention as a triple $(e_{i},e_{j},s)$ , consisting of an entity pair $(e_{i},e_{j})$ and a sentence $s$ . The sentence $s$ contains these two entities $e_{i}$ and $e_{j}$ , and expresses the relation $r$ of the two entities.

Definition 3. (Supervision Knowledge) In the context of distant supervision, supervision knowledge is some supervision signal for automatic dataset annotation to liberate manpower. Relation instances in knowledge bases are usually taken as such supervision knowledge. However, knowledge bases may not be available in some domains for DS. Alternatively, we propose to extract supervision knowledge from news data via high-quality rules in this paper.

Problem. **(Time-DS for Relation Extraction) ** *Given an unlabeled corpus with time stamp like news data, Time-DS is supposed to make the most of time to produce high-quality annotated dataset automatically, or to eliminate the bad effects of noises produced in the basic DS for relation extraction modeling training. *

Here are three questions which Time-DS must deal with,

Question 1. How to obtain supervision knowledge when knowledge base is unavailable in the target domain?

Question 2. How to eliminate the bad effects of noises in obtained relation mentions through alignment of DS?

Question 3. Can we make use of these noises in a reasonable way instead of discarding them simplistically?

3. Time-DS

We introduce details of time-aware distant supervision (Time-DS) framework in this section, following four steps of Figure 2. First, a rule-based method is utilized to extract supervision knowledge for the case when KB is unavailable (in Section 3.1, to answer Question 1). Second, the instance-popularity distribution for each relation instance is approximated based on its rule-matched mentions in news data(in Section 3.2, 3.3). Then, aligning the supervision knowledge (i.e., relation instances) to sentences in raw corpus with instance-popularity attached, to generate automatic annotated dataset for relation extraction model training (in Section 3.4). Finally, Time-DS considers two strategies for better use of the time series instance-popularity, that is hard filter to answer Question 2 and curriculum learning to answer Question 3 (in Section 3.4).

3.1. Extracting Supervision Knowledge

KB is a typical supervision knowledge in previous DS based studies. However, KB is usually unavailable in some specific domains. This issue brings the Question 3. Even when KB is available, it is still impossible to get instance-popularity distribution only using the relation instances in KB. Some information such as relation mentions and timestamp are also needed. Therefore, it is valuable to get a few of relation instances firstly from raw corpus as supervision knowledge, along with the relation mentions with timestamp.

First, we apply a few of pre-designed high-quality rules to extract relation instances as candidate supervision knowledge. At the same time we reserve the relation mentions with timestamp. Here each rule follows the template of $<$ Pattern, Constraint $>$ , where Pattern is a regular expression containing a selected connector, Constraint is a lexical constraint on entities to which the pattern can be applied. For example, given the connector “has formed a partnership with”, we use the pattern “[entity1] has formed a partnership with [entity2]” to extract Partnership relationship between organizations, with a constraint that both [entity1] and [entity2] must be organizations. As a consequence, this pattern can match the sentence “Microsoft has formed a partnership with Facebook”, but will not match the sentence “Kevin has formed a partnership with Jack to finish the project”.

Second, we calculate the confidence of each extracted relation instance, and set a reasonable threshold as filter to obtain the final supervision knowledge (i.e. a set of relation instances with high-confidence), as the first step showed in Figure 2. We believe that the confidence of a relation instance $r(e_{i},e_{j})$ , denoted as $C(r(e_{i},e_{j}))$ , depends on the amount of relation mentions and the kinds of matched rules in these mentions. We assume that more mentions and matched rules mean more reliable relation instance. According to the assumption and treat mentions and matched rules equally, we define the $C(r(e_{i},e_{j}))$ as follows,

[TABLE]

where $\{rule|\,\mathrm{match}\,r(e_{i},e_{j})\}$ represents the set of matched rules for $r(e_{i},e_{j})$ , and $\{m|\,\mathrm{mention\_of}\,r(e_{i},e_{j})\}$ represents the set of matched sentences for $r(e_{i},e_{j})$ . $Z_{rule}$ and $Z_{s}$ represent the maximum number of the matched patterns and sentences in all relation instances.

3.2. Definition of Instance-Popularity

The assumption of distant supervision is: any sentence that contains a pair of entities which participate in a relation instance is likely to express that relation in some way. However, this assumption is not always true. A large part of sentences containing a pair of entities are noises. The previous Figure 1 indicates that the mentions in a certain period of time have extremely less noises, while in other days are more likely false positives. In other word, whether an aligned sentence expresses the corresponding relation has a strong relevance with time.

Following the intuition, we introduce a time series instance-popularity for each relation instance to indicate how many news expressing the relation every certain period of time. Given a sentence at some time point, which contains two entities of a relation instance, we assume that the certainty of expressing the relation is proportional to the instance-popularity at that time point of the relation instance. Instance-popularity is to prepare for figuring out the issue brought by Question 2.

Formally, the instance-popularity (denoted as InsPo) of a given relation instance $r(e_{i},e_{j})$ at time $t$ is defined as:

[TABLE]

where $\textrm{InsPo}_{r(e_{i},e_{j})}^{t}$ , $A_{r(e_{i},e_{j})}^{t}$ represent the instance-popularity of $r(e_{i},e_{j})$ at $t$ and the set of sentences expressing $r(e_{i},e_{j})$ at $t$ -centric time-window separately. $|A_{r(e_{i},e_{j})}^{t}|$ denotes the amount of the $A_{r(e_{i},e_{j})}^{t}$ . $\Omega_{r(e_{i},e_{j})}$ is the whole set of sentences expressing $r(e_{i},e_{j})$ over time for normalization, and

[TABLE]

where $|\Omega_{r(e_{i},e_{j})}|$ is the amount of $\Omega_{r(e_{i},e_{j})}$ . $L$ is the length of the time-window, and $N$ is the amount of time points we concern.

3.3. Approximation of Instance-Popularity

The actual whole set of sentences which express the relation instance is usually unavailable in practice. Thus we cannot compute the instance-popularity directly according to Equation 2. In this section, we provide an approximate method to calculate the instance-popularity.

For each relation instance $r(e_{i},e_{j})$ in the supervision knowledge (gained in Sec 3.2), the set of its whole rule-matched sentences is denoted as ${\Omega^{\prime}}_{r(e_{i},e_{j})}$ , which is a sub-set of $|\Omega_{r(e_{i},e_{j})}|$ . Further, we can obtain the sentences set in any $t$ -centric time-window from ${\Omega^{\prime}}_{r(e_{i},e_{j})}$ , denoted as ${A^{\prime}}_{r(e_{i},e_{j})}^{t}$ . The assumption is that people select relation patterns under some distribution in news data, and we use $p_{\mathcal{P}}^{t}$ to denotes such probability of the relation pattern $\mathcal{P}$ being selected to express the given relation instance at time point $t$ . Then we can calculate $|{A^{\prime}}_{r(e_{i},e_{j})}^{t}|$ and $|{\Omega^{\prime}}_{r(e_{i},e_{j})}|$ under the probability distribution of the relation patterns in the pre-designed rules,

[TABLE]

where $\mathcal{P}_{k}$ is the $k$ -th relation pattern. $K,N$ are the number of relation patterns and time points separately.

Assumption. In the multi-source news corpus, a given a pattern $\mathcal{P}_{k}$ for expressing a relation instance $r(e_{i},e_{j})$ would be selected with the same probability $p_{\mathcal{P}_{k}}$ at any time, which means that the probability of a given pattern being selected is independent of time. Therefore we have, $p_{\mathcal{P}_{k}}^{t_{1}}\approx p_{\mathcal{P}_{k}}^{t_{2}}\approx\cdots\approx p_{\mathcal{P}_{k}}$ .

According to the Equation 2 to 5, and Assumption, we can get the following equations.

[TABLE]

Based on the Equation 6, 7, we can approximate the instance-popularity for $r(e_{i},e_{j})$ at time point $t$ , as the second step in Figure 2. The approximation equation is as follows.

[TABLE]

3.4. Two Strategies to Exploit Instance-Popularity

Given the unlabeled corpus with timestamp, we align the supervision knowledge, i.e., relation instances, to the corpus, as the third step showed in Figure 2. A lot of relation mentions can be obtained, along with the corresponding approximate instance-popularity. In other words, a large scaled annotated dataset attached with instance-popularity is acquired. Relation extraction model can be trained on the datasets. In the training process of relation extraction model, for better use of the time series instance-popularity, Time-DS considers two strategies, as the last step of Figure 2, that is hard filter and curriculum learning.

Hard Filter. We have obtained the dataset attached with instance-popularity in sentence-level, and instance-popularity is to quantize the reliability of the annotated sentences. Hard filter sets a hard threshold of the instance-popularity to filter the dataset to get a higher-quality sub-dataset, discarding the noises. This is a simplest way to utilize Time-DS

However, despite the effectiveness of hard filter, It (1) heavily relies on the threshold of instance-popularity, and (2) refuses to use some noises to make DS based models robust. We hope that the noises can be used reasonably to get a more sufficient training rather than discard these noises directly, as asked in Question 3. A natural idea is guiding the relation extraction model to adapt to the noisy training sets gradually, i.e., learning something simple first, and then attempting to deal with noises. Fortunately, a technique called curriculum learning fits our problem.

Curriculum Learning. The main idea of curriculum learning (Bengio et al., 2009) is simple: starting with the easiest aspect of a task, and leveling up the difficulty gradually. In our study, we begin with the high-quality annotated sentences and gradually add low-quality sentences into training set according to instance-popularity. In particular, all the annotated sentences from distant supervision are ranked by instance-popularity from high to low. Then we divide the ranking list into several groups by assigning different thresholds of instance-popularity, i.e., {Rank1,…,n, Rankn+1,…,n+k, Rankn+k+1,…,n+m, …}. Therefore, different training sets can be easily created by gradually combining different groups of annotated sentences with the ranking order, i.e., Rank1,…,n, Rank1,…,n+k, Rank1,…,n+m, etc.

Then, following the strategy of curriculum learning, (1) first, the model is trained on the highest-quality training set, that is Rank1,…,n. After the training is complete, (2) the second highest-quality training set and the previous training set, i.e, the highest-quality training set, are merged to generate a new training set, that is Rank1,…,n+k. The model is trained again in this new training set. (3) Then, repeat the above processes, i.e., add the the lower-quality training set gradually and train the model in every new training set until all annotated sentences from DS are taken into consideration. Note that the training instances, i.e., sentences, are shuffled during each training process.

4. Experiments

In this part, we conduct experiments of relations extraction/classification on five time-sensitive relations from a multi-source news corpus.

4.1. Data Preparation

We collect about 42 million news articles from 50,428 different on-line news websites, and the time spans 8 months from Jan. 2016 to Aug. 2016. In each article, the title, first paragraph, and timestamp are remained to construct a multi-source news corpus. The corpus contains nearly 320 million sentences in total111The multi-source news corpus will be open available.. Stanford CoreNLP tool (Manning et al., 2014) is applied to recognize named entities in the multi-source new corpus. Organization management is an interesting and informative domain in news data, thus we focus on five typical time-sensitive relations in the domain, namely, Acquisition, Investing, JobChange, Lawsuit, and Partnership. It is worth to mention that Time-DS can be easily transfered to any other time-sensitive relations such as MarriedTo, VisitIn, by just designing a few of high-quality rules of these relations.

Acquisition. An organization buys another organization (directed). Example: Verizon announced it had completed the $ 4.4 billion acquisition of AOL.

Investing. An organization puts money into another organization (directed). Example: Vontobel Asset Management Inc. boosted its position in shares of Mastercard Inc..

JobChange. A person leave or join in an organization (directed). Example: Papiss Cisse has left Newcastle United.

Lawsuit. An organization suits another organization (directed). Example: Samsung Elec sues Huawei for patent infringement.

Partnership. An organization forms a partnership with another organization (undirected). Example: Konami has announced a partnership with FC Barcelona for PES 2017.

Test Set. We follow the previous study (Mintz et al., 2009) to hold out part of the relation instances in supervision knowledge to be aligned into corpus to get test set. However, such test set also suffers from the wrong labeling problem, leading to a rough measure of the performance. Hence we refine the test set in three steps. (1) First, filter the test instances with a suitable instance-popularity threshold222In our experiments, we set the threshold as 0.2 for relation Investing and 0.7 for the other four relations to get the candidate positive samples, and also reserve some of filtered instances as candidate negative samples. (2) Then three experts in the relevant domain proofread the candidate test set independently, that is judging and correcting the correctness of the existing tags. (3) Finally, the remaining disagreements are resolved, and if no consensus could be achieved, the samples are removed. At last, 2,463 positive and 376 negative are achieved to form the final test set (in Table 1), in which 186 samples has been corrected.

Validation Set. The above test set is randomly partitioned into 10 equal size subsets. Of the 10 subsets, a single subset is retained as the validation set for selecting model, and the remaining 9 subsets are used as testing data. This process is then repeated 10 times, with each of the 10 subsets used exactly once as the validation set. The trained model with the best averaged performance on validation sets is selected as the final model for evaluation.

4.2. Target Models

In this part, we describe two models which are fed into our Time-DS and the basic DS framework for end-to-end relation extraction task and relation classification task. The relation extraction task is to extract relation mentions from the given sentences, and categorize them into a pre-defined set of relation types. If the relation mentions are given, then the task is a classification problem, called relation classification. In this paper, the relation extraction and classification we studied are both in sentence-level.

End-to-End Relation Extraction. We feed the model proposed by Zheng et al (2017), i.e., LSTM-LSTM-Bias into our Time-DS framework. LSTM-LSTM-Bias designs a novel tagging schema to convert the task to a sequence tagging problem. Therefore the model can extract entities and relations jointly without other redundant information and achieve the best results on the public dataset. Since LSTM-LSTM-Bias is a sequence tagging model, the training only need word-level positive and negative. Thus we can train the model on the automatic annotated datasets generated by Time-DS directly.

Relation Classification. We applied a Bi-LSTM and Attention based neural model proposed by Zhou (2016), which is a typical paradigm for relation classification, called Att-BLSTM. The training of Att-BLSTM needs negative samples, which is unavailable from the generated datasets in manner of Time-DS. To obtain negative samples, we replace tail entity of each relation instance with another entity, which is in the same sentence. For instance, given a sentence “A has formed a partnership with B, which is located in C” and its relation instance Partnership(A, B), we replace the entity B with entity C. The new relation instance Partnership(A, C) and original sentence form a negative sample.

Metrics. Similar with previous work, we report the aggregate precision/recall curves on the end-to-end relation extraction model, and macro-F1 on relation classification model.

4.3. Annotated Datasets Generation with Instance-Popularity

For Question 1, i.e., how to obtain supervision knowledge when knowledge base is unavailable, we apply only a few of manually designed high-quality rules (see in Section 3.1)333Only 5 to 8 different rules for each relation is suitable in our experiments.. Meanwhile the rule-matched sentences are reserved for instance-popularity approximation. Then we use confidence (see Equation 1) to filter the rule-based extracted relation instances to form the final supervision knowledge. The final supervision knowledge is divided into two sub-sets, i.e., for training and test, according to held-out method. The statistics is reported in the “#Relation Instances” column of Table 1. We can see that the amount of gained relation instances is much smaller (only about 2.1 hundred) compared with the existent KB(over 3.2 million relation instances are obtained in FreeBase in (Riedel et al., 2010)). However, with the help of the multi-source news corpus, we can gain huge amount of expressive annotated sentences.

Aligning the relation instances to the corpus, we get 248,872 training instances, i.e., relation mentions, annotated with instance-popularity, called Original Set. The Instance-popularity distribution of relation instance is directly approximated according to Equation 8444The size of time-window is set 3 days in our experiments, based on the reserved rule-matched sentences. To use hard filter strategy, we filter the Original Set to get another six sub-sets according to the different instance-popularity thresholds. The highest threshold is set 0.6, because the amount of Investing mentions are nearly zero when the threshold higher than 0.6. The overall statistic of annotated dataset is showed in Table 1.

4.4. Effectiveness of Hard Filter

To answer Question 2, i.e., how to eliminate the bad effects of noises produced in the basic DS, we adapt a instance-popularity-based hard filter strategy. Here we examine the the performance of hard filter strategy on relation extraction and classification tasks.

Figure 3 shows the aggregate precision/recall curves of relation extraction model trained on different datasets. we find that: (1) Except 0.1 threshold of hard filter, the model trained on all datasets of hard filter outperform the one trained in manner of the basic DS, i.e., Original Set. (2) It is clear that the accuracy/recall performance increases with the increase of the threshold of instance-popularity when the threshold is lower than 0.5.

We also investigate more fine-grained precision/recall curves of the five relations separately (see in Figure 4)555Some precision/recall curves backtrack, such as Investing, that is because LSTM-LSTM-Bias is trained and to do predict in tag-level instead of relation-level (Zheng et al., 2017).. we find that: (1) Similar general tendency is observed, that is the model train on datasets of hard filter outperform the one trained in manner of the basic DS when threshold of hard filter higher than some value. (2) The effects of hard filters on different relation types is not exactly same. The performance for Partnership increases as the hard filter became stricter, while the performance on JobChange already peaks when the threshold arrives 0.1.

Table 2 presents the macro-average F1 value on relation classification model. We can see that (1) Time-DS framework on different fractions of training set outperforms the basic DS framework on the whole training set for relation classification; (2) The macro-F1 value tends to increase with the increase of the threshold of instance-popularity; (3) Even with a smaller scale of training data, when $InsPo\geq 0.6$ the corresponding training set is almost one-thirteenth of the Original Set (see Table 1), the Time-DS framework outperforms basic DS framework, achieving significant improvement for relation classification. Above all, it is very clear from these three observations that hard filter with instance-popularity is a very effective strategy to segment Original Set and eliminate the bad effects of the noises generated from the basic DS.

4.5. Effectiveness of Curriculum Learning

Hard filter strategy heavily relies on the threshold settings to remove the effects of noisy samples. However, noisy samples still play important roles in improving the robustness of models. This issue naturally bring the Question 3, i.e., can we make use of these noises in a reasonable way to improve the robustness rather than discarding them simplistically. The strategy of curriculum learning is used in Time-DS to answer Question 3.

To implement Time-DS with curriculum learning, we reconstruct the original set to get 7 subsets and apply a curriculum learning strategy. In particular, we distribute the original set to get seven subsets with different instance-popularity ranges, i.e., [0.6, 1.0], [0.5, 1.0], [0.4, 1.0], [0.3, 1.0], [0.2, 1.0], [0.1, 1.0], [0.0, 1.0]. At 1st round, we use the subset [0.6, 1.0] to train the model for relation classification. At 2nd round, we use the subset [0.5, 1.0] to retrain the model on the basis of the model obtained from the 1st round. At 3rd round, we use the subset [0.4, 1.0] to retrain the model on the basis of the model obtained from the 2nd round. The latter rounds follow the same rules until 7th round when all instances of Original Set participate in the training process.

Figure 5 presents the performance on relation extraction666Some precision/recall curves starts from non-zero point, which is because softmax may output 1 in some dimension for optimization in Python language.. We find that (1) curriculum learning based Time-DS significantly outperform the basic DS in any training round; (2) Each training round of curriculum learning outperform the traditional training in hard filter strategy. Table 3 presents the performance on relation classification. (1) From the comparison between the different rounds of Time-DS with curriculum learning and the basic DS, it is clear that Time-DS with curriculum learning outperforms the basic DS in every round of training and Time-DS with curriculum learning achieves the best performance at round 4. (2) From round 1 to round 4, the noisy samples are gradually added, the performance tends to increase. However, the performance decrease when adding to much noisy samples into the training set after round 4.

4.6. Deep analysis of Instance-Popularity

In this section, we provide deep analysis about how instance-popularity works to encode the strong relevance of timestamp and the true relation mention in news data.

Time-sensitive Relations. Many relations in news is very sensitive to time, such as these relations in Table 5. At the time period of the establishment of these relations, there would be extremely less noises of these relation mentions. This feature would benefit a lot for the alignment process of distant supervision. Therefore, we want to check the consistency of peaking time of instance-popularity and the establishment time of each relation instance. In particular, we sample several relation instances randomly, and acquire the establishment time from Wikipedia or news report. In Table 5, we can find that the time when instance-popularity reaches peak is usually consistent with the establishment time. Therefore, it is reasonable to take instance-popularity as measure to find relation mentions as training data with less noises.

**The Ability to Eliminate the Noises. ** We investigate the ability of instance-popularity as hard filter to eliminate the noises generated from the DS alignment. Although the instance-popularity has been proved to be useful indirectly a component of Time-DS, a direct evaluation would be much more straightforward. Specifically, we randomly sample 100 training instances respectively from each data sets to check the ratio of noises in each training set. Table 4 presents the the ratio of noises in each training set. Note that our models which are trained on subset “ $\geq 0.3$ ” and subset “ $\geq 0.5$ ” achieve the best performance for relation classification and extraction. It is clear that the training sets with the lower ratio of noises tend to be those with which our model achieves the better performance. This is not very strict because the scale of training set also affects the performance.

Error Cases Study. In this study, we have two types of error cases. First, there are some aligned sentences with low-instance-popularity is actually the true relation instances. We call them low-instance-popularity but positive case (LP). Second, there are some aligned sentences with timestamp consisting with the peak time of instance-popularity are actually the fake relation instances. We call them high-instance-popularity but negative case (HN). We present some LP and HN cases as follows.

LP1. “Jose Mourinho is reportedly set to be confirmed as Manchester United’s new manager in the coming days.”, InsPo: 0.0, relation: JobChange.

LP2. “Activision Blizzard’s acquisition of Major League Gaming appears to be bearing fruit.”, InsPo: 0.053, relation: Acquisition.

LP3. “… between two giants in the technology world, as we ’ve seen repeatedly with the Apple v. Samsung litigation .”, InsPo: 0.097, relation: Lawsuit.

HN1. “LinkedIn will give Microsoft an even greater foothold in the space …”, InsPo: 0.67, relation: Acquisition.

HN2. “Google and Fiat Chrysler engineers will fit Google ’s autonomous driving technology into the Pacifica minivan.”, InsPo: 0.5, relation: Partnership.

HN3. “Warren Buffett, fondly known as the Oracle of Omaha … behind brainchild Berkshire Hathaway Inc. just upped his stake in Apple Inc. by a significant chunk.”, InsPo: 0.80, relation: Investing.

The above LP cases have different situations. (1) Some relation instances are reported in the news before the official establishment of the relations. Therefore, the timestamp of these mentions is earlier than the peak time of instance-popularity, e.g., LP1. (2) Some relation instances are still reported in the news even a long time after the official establishment of the relations. Therefore, the timestamp of these mentions is much latter than the peak time of instance-popularity, e.g., LP2. (3) Some relation instances will be reported by news media for a long time. In this case, instance-popularity will be normally distributed in a long time period. Therefore, instance-popularity usually fails to detect the time period of the establishment of the relations in a short time interval, e.g., the 6 years lawsuits between Apple Inc. and Samsung Electronics.

The above HN cases have different situations. (1) Some alignment mentions actually talk about other aspects of the mentioned entities. For example HN1 discusses the influence of the acquisition rather than expressing the acquisition as a relation. (2) Some cases provide incomplete information, making it hard to confirm the existence of the relation, e.g., HN2 and HN3.

5. Related Work

5.1. Improvements for Distant Supervision

To alleviate effects of noises in automatic annotated dataset of DS, some studies captured certain types of noise and aggregated multi-instance learning (Riedel et al., 2010; Surdeanu et al., 2012; Ritter et al., 2013; Min et al., 2013). Some neural networks methods learned from multiple instances attentively, without explicitly characterizing the inherent noise (Zeng et al., 2015; Lin et al., 2016; Feng et al., 2017). These approaches focus on enhancing noise-tolerance of models instead of reducing noises from the source, hence, still suffer from the effects of noises in some ways. Some work considered utilizing many other kinds of knowledge besides KB (Han and Sun, 2016; Liu et al., 2017), to enrich the supervision knowledge. However, such studies suffer from the conflicts brought by multiple supervisions (Ratner et al., 2016), and hard to benefit the existing relation extraction/classification models.

5.2. Relation Extraction/Classification

Relation classification aims to classify the given relation mention to a pre-defined relation type. Deep neural networks have shown promising results, and the representative progress was made by Zeng et al. (2014). To encode both past and future context information, Zhang and Wang (2015) employed a bidirectional Recurrent Neural Network (Bi-RNN). To address the long-distance problem, some approaches based on Long Short-Term Memory networks (LSTM) have been proposed (Zhang et al., 2015; Xu et al., 2015). Recently, Zhou (2016) combined the attention model and bidirectional LSTM, achieving a significant improvements for relation classification.

Relation extraction can be regard as a pipeline of two separated tasks, i.e., named entity recognition and relation extraction. However, some studies consider extracting entities and relations in a single model. Most of these methods are feature-based (Ren et al., 2017; Yang and Cardie, 2013; Miwa and Sasaki, 2014; Li and Ji, 2014). Recently, Miwa and Bansal (Miwa and Bansal, 2016) used a LSTM-based model to reduce such manual features. Zheng (Zheng et al., 2017) converted the relation extraction to a sequence tagging problem, and proposed a LSTM-based encoder-decoder model to extract the entities and relations jointly without other redundant information, leading to the best results on the public dataset.

5.3. Curriculum Learning

The main idea of curriculum learning (Bengio et al., 2009) is starting with the easiest aspect of a task and leveling up the difficulty gradually. Curriculum learning is mainly applied to solve various vision problems of Computer Vision (CV), such as tracking (Supancic III and Ramanan, 2013), face detection (Lin et al., 2018), object detection (Chen and Gupta, 2015), video detection (Jiang et al., 2014), etc. Luo et al. (2017) applied curriculum learning to the task of relation classification. However, they used curriculum learning to address the cold-start of their model training, on a special dataset with explicit prior knowledge of data quality, which was different from our work.

6. Conclusion

In this paper, to alleviate the noise issue in distant supervision (DS), we take a new factor time into consideration and propose a novel time-aware distant supervision (Time-DS). To make the most of time, we consider two strategies, i.e., hard filter and curriculum learning. Time-DS benefits from these two strategies thus can guide the training process and further achieves better models on relation extraction/classification. The experimental results show the effectiveness of the time series instance-popularity and significant improvements on relation extraction/classification via feeding models into Time-DS.

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Bengio et al . (2009) Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning . ACM, 41–48.
3Chen and Gupta (2015) Xinlei Chen and Abhinav Gupta. 2015. Webly supervised learning of convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision . 1431–1439.
4Feng et al . (2017) Xiaocheng Feng, Jiang Guo, Bing Qin, Ting Liu, and Yongjie Liu. 2017. Effective deep memory networks for distant supervised relation extraction. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI . 19–25.
5Han and Sun (2016) Xianpei Han and Le Sun. 2016. Global distant supervision for relation extraction. In Thirtieth AAAI Conference on Artificial Intelligence .
6Jiang et al . (2014) Lu Jiang, Deyu Meng, Shoou-I Yu, Zhenzhong Lan, Shiguang Shan, and Alexander Hauptmann. 2014. Self-paced learning with diversity. In Advances in Neural Information Processing Systems . 2078–2086.
7Li and Ji (2014) Qi Li and Heng Ji. 2014. Incremental joint extraction of entity mentions and relations. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , Vol. 1. 402–412.
8Lin et al . (2018) Liang Lin, Keze Wang, Deyu Meng, Wangmeng Zuo, and Lei Zhang. 2018. Active self-paced learning for cost-effective and progressive face identification. IEEE transactions on pattern analysis and machine intelligence 40, 1 (2018), 7–19.

Relation Instance	Instance-Popularity	Peaking Time	Establishment Time
Acquisition (Pfizer, Anacor Pharmaceuticals)	May	May 16-18, 2016	May 16, 2016
Acquisition (NBC, DreamWorks Animation)	Apr.	Apr. 25-30, 2016	Apr. 28, 2016
Investing (Private Trust, Honeywell International)	Feb.	Feb. 25-27 2016	Feb. 25 2016
Investing (Jennison Associates, Boeing)	Jan.	Jan. 13-15, 2016	Jan. 13, 2016
JobChange (Louis Van Gaal, Manchester United)	May	May 22-24, 2016	May 23, 2016
JobChange (Derek Fisher, Knicks)	Feb.	Feb. 7-9, 2016	Feb. 7, 2016
Lawsuit(Huawei, Samsung Electronics)	May	May 22-27, 2016	May 25, 2016
Lawsuit (Wal-Mart Stores Inc., Visa Inc)	May	May 10-12, 2016	May 10, 2016
Partnership (Microsoft, Facebook)	May	May 25-27, 2016	May 26, 2016
Partnership (Google, Fiat)	May	May 4-6, 2016	May 3, 2016