Named Entity Recognition for Partially Annotated Datasets

Michael Strobl; Amine Trabelsi; Osmar Zaiane

arXiv:2204.09081·cs.CL·April 21, 2022

Named Entity Recognition for Partially Annotated Datasets

Michael Strobl, Amine Trabelsi, Osmar Zaiane

PDF

Open Access

TL;DR

This paper compares training strategies for partially annotated datasets in Named Entity Recognition and proposes a method to generate new datasets from Wikipedia, validated by manual annotation of food and drug entities.

Contribution

It introduces and evaluates three training strategies for partially annotated data and presents a novel approach to create new datasets from Wikipedia for NER.

Findings

01

Three training strategies compared for partially annotated data

02

Proposed method to generate datasets from Wikipedia

03

Manual annotation confirms dataset quality

Abstract

The most common Named Entity Recognizers are usually sequence taggers trained on fully annotated corpora, i.e. the class of all words for all entities is known. Partially annotated corpora, i.e. some but not all entities of some types are annotated, are too noisy for training sequence taggers since the same entity may be annotated one time with its true type but not another time, misleading the tagger. Therefore, we are comparing three training strategies for partially annotated datasets and an approach to derive new datasets for new classes of entities from Wikipedia without time-consuming manual data annotation. In order to properly verify that our data acquisition and training approaches are plausible, we manually annotated test datasets for two new classes, namely food and drugs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Wikis in Education and Collaboration