MariNER: A Dataset for Historical Brazilian Portuguese Named Entity Recognition
Jo\~ao Lucas Luz Lima Sarcinelli, Marina Lages Gon\c{c}alves Teixeira, Jade Bortot de Paiva, Diego Furtado Silva

TL;DR
This paper introduces MariNER, the first high-quality dataset for NER in early 20th-century Brazilian Portuguese, enabling better NLP analysis of historical texts and benchmarking of models.
Contribution
It creates and releases the first gold-standard NER dataset for historical Brazilian Portuguese, filling a significant resource gap.
Findings
State-of-the-art NER models achieve moderate performance on MariNER.
The dataset enables future research in digital humanities and NLP for historical texts.
Comparison of models highlights areas for improvement in historical NER.
Abstract
Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task that aims to identify and classify entity mentions in texts across different categories. While languages such as English possess a large number of high-quality resources for this task, Brazilian Portuguese still lacks in quantity of gold-standard NER datasets, especially when considering specific domains. Particularly, this paper considers the importance of NER for analyzing historical texts in the context of digital humanities. To address this gap, this work outlines the construction of MariNER: \textit{Mapeamento e Anota\c{c}\~oes de Registros hIst\'oricos para NER} (Mapping and Annotation of Historical Records for NER), the first gold-standard dataset for early 20th-century Brazilian Portuguese, with more than 9,000 manually annotated sentences. We also assess and compare the performance of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques
