Analysing the Noise Model Error for Realistic Noisy Label Data
Michael A. Hedderich, Dawei Zhu, Dietrich Klakow

TL;DR
This paper investigates the accuracy of noise models in noisy label data, providing theoretical analysis and a new dataset to evaluate noise estimation methods in NLP tasks.
Contribution
It offers a theoretical framework for expected noise model error and introduces NoisyNER, a realistic noisy label dataset with multiple noise patterns and clean references.
Findings
Theoretical bounds on noise model error derived.
Analysis of noise distribution and sampling impact on estimation.
Empirical validation on synthetic and real data.
Abstract
Distant and weak supervision allow to obtain large amounts of labeled training data quickly and cheaply, but these automatic annotations tend to contain a high amount of errors. A popular technique to overcome the negative effects of these noisy labels is noise modelling where the underlying noise process is modelled. In this work, we study the quality of these estimated noise models from the theoretical side by deriving the expected error of the noise model. Apart from evaluating the theoretical results on commonly used synthetic noise, we also publish NoisyNER, a new noisy label dataset from the NLP domain that was obtained through a realistic distant supervision technique. It provides seven sets of labels with differing noise patterns to evaluate different noise levels on the same instances. Parallel, clean labels are available making it possible to study scenarios where a small…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
