Multilingual TinyStories: A Synthetic Combinatorial Corpus of Indic Children's Stories for Training Small Language Models

Deepon Halder; Angira Mukherjee

arXiv:2603.14563·cs.CL·March 17, 2026

Multilingual TinyStories: A Synthetic Combinatorial Corpus of Indic Children's Stories for Training Small Language Models

Deepon Halder, Angira Mukherjee

PDF

Open Access 1 Datasets

TL;DR

This paper introduces Multilingual TinyStories, a large synthetic dataset of children's stories in 17 Indian languages, designed to improve training of small language models for low-resource languages.

Contribution

It presents a novel hybrid curation pipeline combining native language generation and translation to create a large-scale, high-quality multilingual corpus for low-resource language modeling.

Findings

01

Compiled 132,942 stories with 93.9 million tokens

02

Enables training and evaluation of small multilingual language models

03

Facilitates transfer learning in Indic languages

Abstract

The development of robust language models for low-resource languages is frequently bottlenecked by the scarcity of high-quality, coherent, and domain-appropriate training corpora. In this paper, we introduce the Multilingual TinyStories dataset, a large-scale, synthetically generated collection of children's stories encompassing 17 Indian languages. Designed specifically for the training and evaluation of Small Language Models (SLMs), the corpus provides simple, narrative-driven text strictly localized to native scripts. We detail our hybrid curation pipeline, which leverages the Sarvam-M language model and a novel combinatorial prompt engineering framework for native generation, coupled with the Google Translate API for large-scale cross-lingual expansion. Through strict programmatic filtering, we compiled 132,942 stories and over 93.9 million tokens in our release, serving as a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

deeponh/multilingual-tinystories
dataset· 49 dl
49 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsICT in Developing Communities · Text Readability and Simplification · Natural Language Processing Techniques