Named Entity Recognition for Partially Annotated Datasets
Michael Strobl, Amine Trabelsi, Osmar Zaiane

TL;DR
This paper compares training strategies for partially annotated datasets in Named Entity Recognition and proposes a method to generate new datasets from Wikipedia, validated by manual annotation of food and drug entities.
Contribution
It introduces and evaluates three training strategies for partially annotated data and presents a novel approach to create new datasets from Wikipedia for NER.
Findings
Three training strategies compared for partially annotated data
Proposed method to generate datasets from Wikipedia
Manual annotation confirms dataset quality
Abstract
The most common Named Entity Recognizers are usually sequence taggers trained on fully annotated corpora, i.e. the class of all words for all entities is known. Partially annotated corpora, i.e. some but not all entities of some types are annotated, are too noisy for training sequence taggers since the same entity may be annotated one time with its true type but not another time, misleading the tagger. Therefore, we are comparing three training strategies for partially annotated datasets and an approach to derive new datasets for new classes of entities from Wikipedia without time-consuming manual data annotation. In order to properly verify that our data acquisition and training approaches are plausible, we manually annotated test datasets for two new classes, namely food and drugs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Wikis in Education and Collaboration
