StoryDB: Broad Multi-language Narrative Dataset
Alexey Tikhonov, Igor Samenko, Ivan P. Yamshchikov

TL;DR
StoryDB is a comprehensive multi-language narrative dataset with over 42 languages and thousands of stories, designed to facilitate research in multilingual natural language processing and benchmarking of language models.
Contribution
The paper introduces StoryDB, a large-scale, multi-language narrative dataset with rich annotations, enabling cross-lingual NLP research and model evaluation.
Findings
Dataset includes 42 languages with 500+ stories each
Demonstrated use of dataset to benchmark three multilingual models
Rich topical and language variation in the dataset
Abstract
This paper presents StoryDB - a broad multi-language dataset of narratives. StoryDB is a corpus of texts that includes stories in 42 different languages. Every language includes 500+ stories. Some of the languages include more than 20 000 stories. Every story is indexed across languages and labeled with tags such as a genre or a topic. The corpus shows rich topical and language variation and can serve as a resource for the study of the role of narrative in natural language processing across various languages including low resource ones. We also demonstrate how the dataset could be used to benchmark three modern multilanguage models, namely, mDistillBERT, mBERT, and XLM-RoBERTa.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsmBERT
