MultiCoNER: A Large-scale Multilingual dataset for Complex Named Entity Recognition
Shervin Malmasi, Anjie Fang, Besnik Fetahu, Sudipta Kar, Oleg, Rokhlenko

TL;DR
MultiCoNER introduces a comprehensive multilingual dataset for Named Entity Recognition, covering diverse languages, domains, and challenging scenarios, to facilitate the development of more robust NER systems.
Contribution
It provides a large-scale, multilingual NER dataset with challenging scenarios, and evaluates models including a gazetteer-enhanced approach, highlighting current limitations.
Findings
Baseline XLM-RoBERTa achieves 54% macro-F1 on the dataset.
GEMNET with gazetteers improves performance by 30% macro-F1.
The dataset exposes challenges for large pre-trained models in NER.
Abstract
We present MultiCoNER, a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages, as well as multilingual and code-mixing subsets. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities like movie titles, and long-tail entity distributions. The 26M token dataset is compiled from public resources using techniques such as heuristic-based sentence sampling, template extraction and slotting, and machine translation. We applied two NER models on our dataset: a baseline XLM-RoBERTa model, and a state-of-the-art GEMNET model that leverages gazetteers. The baseline achieves moderate performance (macro-F1=54%), highlighting the difficulty of our data. GEMNET, which uses gazetteers, improvement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
