False Positive and Cross-relation Signals in Distant Supervision Data
Anca Dumitrache, Lora Aroyo, Chris Welty

TL;DR
This paper investigates the quality issues in distant supervision data for relation extraction, identifying false positives and relation interdependencies, and explores crowdsourcing methods to improve data quality and training.
Contribution
It introduces ambiguity-aware CrowdTruth metrics to analyze DS data quality issues and demonstrates preliminary use of crowdsourcing to enhance relation classification training data.
Findings
False positives vary significantly across relations.
Relations exhibit causal connections not captured by DS.
Crowdsourcing can improve training data quality.
Abstract
Distant supervision (DS) is a well-established method for relation extraction from text, based on the assumption that when a knowledge-base contains a relation between a term pair, then sentences that contain that pair are likely to express the relation. In this paper, we use the results of a crowdsourcing relation extraction task to identify two problems with DS data quality: the widely varying degree of false positives across different relations, and the observed causal connection between relations that are not considered by the DS method. The crowdsourcing data aggregation is performed using ambiguity-aware CrowdTruth metrics, that are used to capture and interpret inter-annotator disagreement. We also present preliminary results of using the crowd to enhance DS training data for a relation classification model, without requiring the crowd to annotate the entire set.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
