Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers
Jakub Zavrel, Walter Daelemans

TL;DR
This paper introduces Combi-bootstrap, a novel method that leverages existing heterogeneous taggers and resources to efficiently annotate corpora with new tagsets, significantly improving accuracy over traditional methods.
Contribution
The paper presents a new approach, Combi-bootstrap, which effectively combines multiple existing taggers and resources for accurate corpus annotation with minimal training data.
Findings
Achieves up to 44.7% error reduction compared to single taggers.
Successfully integrates diverse resources for improved annotation accuracy.
Outperforms ensemble taggers trained on the same small sample.
Abstract
This paper describes a new method, Combi-bootstrap, to exploit existing taggers and lexical resources for the annotation of corpora with new tagsets. Combi-bootstrap uses existing resources as features for a second level machine learning module, that is trained to make the mapping to the new tagset on a very small sample of annotated corpus material. Experiments show that Combi-bootstrap: i) can integrate a wide variety of existing resources, and ii) achieves much higher accuracy (up to 44.7 % error reduction) than both the best single tagger and an ensemble tagger constructed out of the same small training sample.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Algorithms and Data Compression
