Granary: Speech Recognition and Translation Dataset in 25 European Languages

Nithin Rao Koluguri; Monica Sekoyan; George Zelenfroynd; Sasha Meister; Shuoyang Ding; Sofia Kostandian; He Huang; Nikolay Karpov; Jagadeesh Balam; Vitaly Lavrukhin; Yifan Peng; Sara Papi; Marco Gaido; Alessio Brutti; Boris Ginsburg

arXiv:2505.13404·cs.CL·May 22, 2025

Granary: Speech Recognition and Translation Dataset in 25 European Languages

Nithin Rao Koluguri, Monica Sekoyan, George Zelenfroynd, Sasha Meister, Shuoyang Ding, Sofia Kostandian, He Huang, Nikolay Karpov, Jagadeesh Balam, Vitaly Lavrukhin, Yifan Peng, Sara Papi, Marco Gaido, Alessio Brutti, Boris Ginsburg

PDF

Open Access 10 Models 2 Datasets

TL;DR

Granary is a large-scale, open-source speech dataset for 25 European languages, enabling improved recognition and translation, especially for low-resource languages, through innovative data processing techniques.

Contribution

This work introduces the first extensive open-source speech dataset for recognition and translation in 25 European languages, with novel data augmentation and filtering methods.

Findings

01

Models trained on Granary data perform comparably with 50% less data.

02

Enhanced data quality improves speech recognition and translation accuracy.

03

Pipeline efficiently processes large datasets within hours.

Abstract

Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity. To address this, we present Granary, a large-scale collection of speech datasets for recognition and translation across 25 European languages. This is the first open-source effort at this scale for both transcription and translation. We enhance data quality using a pseudo-labeling pipeline with segmentation, two-pass inference, hallucination filtering, and punctuation restoration. We further generate translation pairs from pseudo-labeled transcriptions using EuroLLM, followed by a data filtration pipeline. Designed for efficiency, our pipeline processes vast amount of data within hours. We assess models trained on processed data by comparing their performance on previously curated datasets for both high- and low-resource languages. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis