False Positive and Cross-relation Signals in Distant Supervision Data

Anca Dumitrache; Lora Aroyo; Chris Welty

arXiv:1711.05186·cs.CL·December 1, 2017·1 cites

False Positive and Cross-relation Signals in Distant Supervision Data

Anca Dumitrache, Lora Aroyo, Chris Welty

PDF

Open Access 1 Repo

TL;DR

This paper investigates the quality issues in distant supervision data for relation extraction, identifying false positives and relation interdependencies, and explores crowdsourcing methods to improve data quality and training.

Contribution

It introduces ambiguity-aware CrowdTruth metrics to analyze DS data quality issues and demonstrates preliminary use of crowdsourcing to enhance relation classification training data.

Findings

01

False positives vary significantly across relations.

02

Relations exhibit causal connections not captured by DS.

03

Crowdsourcing can improve training data quality.

Abstract

Distant supervision (DS) is a well-established method for relation extraction from text, based on the assumption that when a knowledge-base contains a relation between a term pair, then sentences that contain that pair are likely to express the relation. In this paper, we use the results of a crowdsourcing relation extraction task to identify two problems with DS data quality: the widely varying degree of false positives across different relations, and the observed causal connection between relations that are not considered by the DS method. The crowdsourcing data aggregation is performed using ambiguity-aware CrowdTruth metrics, that are used to capture and interpret inter-annotator disagreement. We also present preliminary results of using the crowd to enhance DS training data for a relation classification model, without requiring the crowd to annotate the entire set.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CrowdTruth/Open-Domain-Relation-Extraction
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems