Razmecheno: Named Entity Recognition from Digital Archive of Diaries "Prozhito"
Timofey Atnashev, Veronika Ganeeva, Roman Kazakov, Daria Matyash,, Michael Sonkin, Ekaterina Voloshina, Oleg Serikov, Ekaterina Artemova

TL;DR
This paper introduces Razmecheno, a new Russian NER dataset from diary texts during Perestroika, addressing the lack of historical and literary domain data, and evaluates its utility with existing NER tools and models.
Contribution
The creation and release of Razmecheno, a novel annotated NER dataset from Russian diaries, filling a gap in historical, literary, and low-resource language datasets.
Findings
Existing NER tools perform variably on Razmecheno.
Fine-tuning pre-trained encoders improves NER performance on the dataset.
Razmecheno supports research in cross-lingual and low-resource NER.
Abstract
The vast majority of existing datasets for Named Entity Recognition (NER) are built primarily on news, research papers and Wikipedia with a few exceptions, created from historical and literary texts. What is more, English is the main source for data for further labelling. This paper aims to fill in multiple gaps by creating a novel dataset "Razmecheno", gathered from the diary texts of the project "Prozhito" in Russian. Our dataset is of interest for multiple research lines: literary studies of diary texts, transfer learning from other domains, low-resource or cross-lingual named entity recognition. Razmecheno comprises 1331 sentences and 14119 tokens, sampled from diaries, written during the Perestroika. The annotation schema consists of five commonly used entity tags: person, characteristics, location, organisation, and facility. The labelling is carried out on the crowdsourcing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Authorship Attribution and Profiling
