Razmecheno: Named Entity Recognition from Digital Archive of Diaries   "Prozhito"

Timofey Atnashev; Veronika Ganeeva; Roman Kazakov; Daria Matyash,; Michael Sonkin; Ekaterina Voloshina; Oleg Serikov; Ekaterina Artemova

arXiv:2201.09997·cs.CL·January 26, 2022·1 cites

Razmecheno: Named Entity Recognition from Digital Archive of Diaries "Prozhito"

Timofey Atnashev, Veronika Ganeeva, Roman Kazakov, Daria Matyash,, Michael Sonkin, Ekaterina Voloshina, Oleg Serikov, Ekaterina Artemova

PDF

Open Access

TL;DR

This paper introduces Razmecheno, a new Russian NER dataset from diary texts during Perestroika, addressing the lack of historical and literary domain data, and evaluates its utility with existing NER tools and models.

Contribution

The creation and release of Razmecheno, a novel annotated NER dataset from Russian diaries, filling a gap in historical, literary, and low-resource language datasets.

Findings

01

Existing NER tools perform variably on Razmecheno.

02

Fine-tuning pre-trained encoders improves NER performance on the dataset.

03

Razmecheno supports research in cross-lingual and low-resource NER.

Abstract

The vast majority of existing datasets for Named Entity Recognition (NER) are built primarily on news, research papers and Wikipedia with a few exceptions, created from historical and literary texts. What is more, English is the main source for data for further labelling. This paper aims to fill in multiple gaps by creating a novel dataset "Razmecheno", gathered from the diary texts of the project "Prozhito" in Russian. Our dataset is of interest for multiple research lines: literary studies of diary texts, transfer learning from other domains, low-resource or cross-lingual named entity recognition. Razmecheno comprises 1331 sentences and 14119 tokens, sampled from diaries, written during the Perestroika. The annotation schema consists of five commonly used entity tags: person, characteristics, location, organisation, and facility. The labelling is carried out on the crowdsourcing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Authorship Attribution and Profiling