MultiCoNER v2: a Large Multilingual dataset for Fine-grained and Noisy   Named Entity Recognition

Besnik Fetahu; Zhiyu Chen; Sudipta Kar; Oleg Rokhlenko; Shervin; Malmasi

arXiv:2310.13213·cs.CL·October 23, 2023·1 cites

MultiCoNER v2: a Large Multilingual dataset for Fine-grained and Noisy Named Entity Recognition

Besnik Fetahu, Zhiyu Chen, Sudipta Kar, Oleg Rokhlenko, Shervin, Malmasi

PDF

Open Access

TL;DR

MULTICONER V2 introduces a comprehensive multilingual dataset for fine-grained NER, addressing challenges of complex entity types and noisy data, and highlights the difficulties in achieving high performance in such settings.

Contribution

It provides a large, multilingual, fine-grained NER dataset with noise challenges, enabling better evaluation and development of robust NER models.

Findings

01

Macro-F1 score of 0.63 across languages

02

Entity noise reduces performance by 9%

03

Fine-grained taxonomy is inherently challenging

Abstract

We present MULTICONER V2, a dataset for fine-grained Named Entity Recognition covering 33 entity classes across 12 languages, in both monolingual and multilingual settings. This dataset aims to tackle the following practical challenges in NER: (i) effective handling of fine-grained classes that include complex entities like movie titles, and (ii) performance degradation due to noise generated from typing mistakes or OCR errors. The dataset is compiled from open resources like Wikipedia and Wikidata, and is publicly available. Evaluation based on the XLM-RoBERTa baseline highlights the unique challenges posed by MULTICONER V2: (i) the fine-grained taxonomy is challenging, where the scores are low with macro-F1=0.63 (across all languages), and (ii) the corruption strategy significantly impairs performance, with entity corruption resulting in 9% lower performance relative to non-entity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies