SubRegWeigh: Effective and Efficient Annotation Weighing with Subword   Regularization

Kohei Tsuji; Tatsuya Hiraoka; Yuchang Cheng; Tomoya Iwakura

arXiv:2409.06216·cs.CL·February 4, 2025

SubRegWeigh: Effective and Efficient Annotation Weighing with Subword Regularization

Kohei Tsuji, Tatsuya Hiraoka, Yuchang Cheng, Tomoya Iwakura

PDF

Open Access 1 Repo

TL;DR

SubRegWeigh introduces a fast, subword regularization-based method for annotation error detection and weighting, significantly reducing computation time while improving performance in NLP tasks like document classification and NER.

Contribution

It presents a novel, efficient annotation weighing method using subword regularization that outperforms existing approaches in speed and accuracy.

Findings

01

Performs annotation weighting 4-5 times faster than previous methods.

02

Improves accuracy in document classification and named entity recognition.

03

Effectively detects pseudo-incorrect labels as annotation errors.

Abstract

NLP datasets may still contain annotation errors, even when they are manually annotated. Researchers have attempted to develop methods to automatically reduce the adverse effect of errors in datasets. However, existing methods are time-consuming because they require many trained models to detect errors. This paper proposes a time-saving method that utilizes a tokenization technique called subword regularization to simulate multiple error detection models for detecting errors. Our proposed method, SubRegWeigh, can perform annotation weighting four to five times faster than the existing method. Additionally, SubRegWeigh improved performance in document classification and named entity recognition tasks. In experiments with pseudo-incorrect labels, SubRegWeigh clearly identifies pseudo-incorrect labels as annotation errors. Our code is available at https://github.com/4ldk/SubRegWeigh .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

4ldk/SubRegWeigh
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications