Few Clean Instances Help Denoising Distant Supervision
Yufang Liu, Ziyin Huang, Yijun Wang, Changzhi Sun, Man Lan, Yuanbin, Wu, Xiaofeng Mou, Ding Wang

TL;DR
This paper demonstrates that a small clean dataset can significantly improve the robustness and evaluation of distantly supervised relation extraction models by introducing influence-based instance selection and a teacher-student bootstrapping mechanism.
Contribution
It introduces a novel influence function-based criterion for clean instance selection and a teacher-student framework to enhance denoising in distantly supervised relation extraction.
Findings
Improved denoising performance on noisy datasets
Enhanced model robustness with small clean datasets
Effective influence-based sample selection method
Abstract
Existing distantly supervised relation extractors usually rely on noisy data for both model training and evaluation, which may lead to garbage-in-garbage-out systems. To alleviate the problem, we study whether a small clean dataset could help improve the quality of distantly supervised models. We show that besides getting a more convincing evaluation of models, a small clean dataset also helps us to build more robust denoising models. Specifically, we propose a new criterion for clean instance selection based on influence functions. It collects sample-level evidence for recognizing good instances (which is more informative than loss-level evidence). We also propose a teacher-student mechanism for controlling purity of intermediate results when bootstrapping the clean set. The whole approach is model-agnostic and demonstrates strong performances on both denoising real (NYT) and synthetic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Time Series Analysis and Forecasting · Anomaly Detection Techniques and Applications
