FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition
Jonas Golde, Patrick Haller, Alan Akbik

TL;DR
FiNERweb is a large-scale, multilingual NER dataset created using a scalable pipeline that leverages LLMs and regression models, enabling improved zero-shot transfer and reliable annotations across 91 languages.
Contribution
The paper introduces FiNERweb, a systematic dataset creation pipeline for multilingual NER, covering 91 languages, with high-quality annotations and release of comprehensive artifacts for research.
Findings
Regression model achieves over 84 F1 score.
Models trained on FiNERweb perform well in zero-shot settings.
High annotation faithfulness and completeness scores.
Abstract
Recent multilingual named entity recognition (NER) work has shown that large language models (LLMs) can provide effective synthetic supervision, yet such datasets have mostly appeared as by-products of broader experiments rather than as systematic, reusable resources. We introduce FiNERweb, a dataset-creation pipeline that scales the teacher-student paradigm to 91 languages and 25 scripts. Building on FineWeb-Edu, our approach trains regression models to identify NER-relevant passages and annotates them with multilingual LLMs, resulting in about 225k passages with 235k distinct entity labels. Our experiments show that the regression model achieves more than 84 F1, and that models trained on FiNERweb obtain comparable or improved performance in zero shot transfer settings on English, Thai, and Swahili, despite being trained on 19x less data than strong baselines. In addition, we assess…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Advanced Graph Neural Networks
