MultiCoNER: A Large-scale Multilingual dataset for Complex Named Entity   Recognition

Shervin Malmasi; Anjie Fang; Besnik Fetahu; Sudipta Kar; Oleg; Rokhlenko

arXiv:2208.14536·cs.CL·September 1, 2022·82 cites

MultiCoNER: A Large-scale Multilingual dataset for Complex Named Entity Recognition

Shervin Malmasi, Anjie Fang, Besnik Fetahu, Sudipta Kar, Oleg, Rokhlenko

PDF

Open Access 1 Datasets

TL;DR

MultiCoNER introduces a comprehensive multilingual dataset for Named Entity Recognition, covering diverse languages, domains, and challenging scenarios, to facilitate the development of more robust NER systems.

Contribution

It provides a large-scale, multilingual NER dataset with challenging scenarios, and evaluates models including a gazetteer-enhanced approach, highlighting current limitations.

Findings

01

Baseline XLM-RoBERTa achieves 54% macro-F1 on the dataset.

02

GEMNET with gazetteers improves performance by 30% macro-F1.

03

The dataset exposes challenges for large pre-trained models in NER.

Abstract

We present MultiCoNER, a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages, as well as multilingual and code-mixing subsets. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities like movie titles, and long-tail entity distributions. The 26M token dataset is compiled from public resources using techniques such as heuristic-based sentence sampling, template extraction and slotting, and machine translation. We applied two NER models on our dataset: a baseline XLM-RoBERTa model, and a state-of-the-art GEMNET model that leverages gazetteers. The baseline achieves moderate performance (macro-F1=54%), highlighting the difficulty of our data. GEMNET, which uses gazetteers, improvement…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

tomaarsen/MultiCoNER
dataset· 330 dl
330 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification