StoryDB: Broad Multi-language Narrative Dataset

Alexey Tikhonov; Igor Samenko; Ivan P. Yamshchikov

arXiv:2109.14396·cs.CL·November 15, 2022

StoryDB: Broad Multi-language Narrative Dataset

Alexey Tikhonov, Igor Samenko, Ivan P. Yamshchikov

PDF

TL;DR

StoryDB is a comprehensive multi-language narrative dataset with over 42 languages and thousands of stories, designed to facilitate research in multilingual natural language processing and benchmarking of language models.

Contribution

The paper introduces StoryDB, a large-scale, multi-language narrative dataset with rich annotations, enabling cross-lingual NLP research and model evaluation.

Findings

01

Dataset includes 42 languages with 500+ stories each

02

Demonstrated use of dataset to benchmark three multilingual models

03

Rich topical and language variation in the dataset

Abstract

This paper presents StoryDB - a broad multi-language dataset of narratives. StoryDB is a corpus of texts that includes stories in 42 different languages. Every language includes 500+ stories. Some of the languages include more than 20 000 stories. Every story is indexed across languages and labeled with tags such as a genre or a topic. The corpus shows rich topical and language variation and can serve as a resource for the study of the role of narrative in natural language processing across various languages including low resource ones. We also demonstrate how the dataset could be used to benchmark three modern multilanguage models, namely, mDistillBERT, mBERT, and XLM-RoBERTa.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsmBERT