TL;DR
This paper introduces an empirical error modeling approach using a sequence-to-sequence model to generate realistic noisy data, significantly enhancing the robustness of neural sequence labeling systems against OCR and user-generated text noise.
Contribution
The paper proposes a novel empirical error generation method and noisy language model embeddings to improve noise robustness in sequence labeling, supported by large-scale training data and benchmarks.
Findings
Outperforms baseline noise generation techniques
Achieves better accuracy on noisy sequence labeling datasets
Provides publicly available code and data for future research
Abstract
Despite recent advances, standard sequence labeling systems often fail when processing noisy user-generated text or consuming the output of an Optical Character Recognition (OCR) process. In this paper, we improve the noise-aware training method by proposing an empirical error generation approach that employs a sequence-to-sequence model trained to perform translation from error-free to erroneous text. Using an OCR engine, we generated a large parallel text corpus for training and produced several real-world noisy sequence labeling benchmarks for evaluation. Moreover, to overcome the data sparsity problem that exacerbates in the case of imperfect textual input, we learned noisy language model-based embeddings. Our approach outperformed the baseline noise generation and error correction techniques on the erroneous sequence labeling data sets. To facilitate future research on robustness,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
