HistNERo: Historical Named Entity Recognition for the Romanian Language
Andrei-Marius Avram, Andreea Iuga, George-Vlad Manolache,, Vlad-Cristian Matei, R\u{a}zvan-Gabriel Micliu\c{s}, Vlad-Andrei Muntean,, Manuel-Petru Sorlescu, Drago\c{s}-Andrei \c{S}erban, Adrian-Dinu Urse, Vasile, P\u{a}i\c{s}, Dumitru-Clementin Cercel

TL;DR
HistNERo introduces the first Romanian historical newspaper NER dataset, enabling improved recognition of named entities across 19th to 20th-century texts using domain adaptation techniques.
Contribution
This paper presents a novel Romanian historical NER dataset and demonstrates enhanced model performance through a new domain adaptation method.
Findings
Best model achieved a strict F1-score of 55.69%
Domain adaptation improved F1-score to 66.80%
Dataset covers over 70 years of Romanian history
Abstract
This work introduces HistNERo, the first Romanian corpus for Named Entity Recognition (NER) in historical newspapers. The dataset contains 323k tokens of text, covering more than half of the 19th century (i.e., 1817) until the late part of the 20th century (i.e., 1990). Eight native Romanian speakers annotated the dataset with five named entities. The samples belong to one of the following four historical regions of Romania, namely Bessarabia, Moldavia, Transylvania, and Wallachia. We employed this proposed dataset to perform several experiments for NER using Romanian pre-trained language models. Our results show that the best model achieved a strict F1-score of 55.69%. Also, by reducing the discrepancies between regions through a novel domain adaption technique, we improved the performance on this corpus to a strict F1-score of 66.80%, representing an absolute gain of more than 10%.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
