Empirical Error Modeling Improves Robustness of Noisy Neural Sequence   Labeling

Marcin Namysl; Sven Behnke; Joachim K\"ohler

arXiv:2105.11872·cs.CL·May 26, 2021

Empirical Error Modeling Improves Robustness of Noisy Neural Sequence Labeling

Marcin Namysl, Sven Behnke, Joachim K\"ohler

PDF

1 Repo

TL;DR

This paper introduces an empirical error modeling approach using a sequence-to-sequence model to generate realistic noisy data, significantly enhancing the robustness of neural sequence labeling systems against OCR and user-generated text noise.

Contribution

The paper proposes a novel empirical error generation method and noisy language model embeddings to improve noise robustness in sequence labeling, supported by large-scale training data and benchmarks.

Findings

01

Outperforms baseline noise generation techniques

02

Achieves better accuracy on noisy sequence labeling datasets

03

Provides publicly available code and data for future research

Abstract

Despite recent advances, standard sequence labeling systems often fail when processing noisy user-generated text or consuming the output of an Optical Character Recognition (OCR) process. In this paper, we improve the noise-aware training method by proposing an empirical error generation approach that employs a sequence-to-sequence model trained to perform translation from error-free to erroneous text. Using an OCR engine, we generated a large parallel text corpus for training and produced several real-world noisy sequence labeling benchmarks for evaluation. Moreover, to overcome the data sparsity problem that exacerbates in the case of imperfect textual input, we learned noisy language model-based embeddings. Our approach outperformed the baseline noise generation and error correction techniques on the erroneous sequence labeling data sets. To facilitate future research on robustness,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mnamysl/nat-acl2021
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.