Granary: Speech Recognition and Translation Dataset in 25 European Languages
Nithin Rao Koluguri, Monica Sekoyan, George Zelenfroynd, Sasha Meister, Shuoyang Ding, Sofia Kostandian, He Huang, Nikolay Karpov, Jagadeesh Balam, Vitaly Lavrukhin, Yifan Peng, Sara Papi, Marco Gaido, Alessio Brutti, Boris Ginsburg

TL;DR
Granary is a large-scale, open-source speech dataset for 25 European languages, enabling improved recognition and translation, especially for low-resource languages, through innovative data processing techniques.
Contribution
This work introduces the first extensive open-source speech dataset for recognition and translation in 25 European languages, with novel data augmentation and filtering methods.
Findings
Models trained on Granary data perform comparably with 50% less data.
Enhanced data quality improves speech recognition and translation accuracy.
Pipeline efficiently processes large datasets within hours.
Abstract
Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity. To address this, we present Granary, a large-scale collection of speech datasets for recognition and translation across 25 European languages. This is the first open-source effort at this scale for both transcription and translation. We enhance data quality using a pseudo-labeling pipeline with segmentation, two-pass inference, hallucination filtering, and punctuation restoration. We further generate translation pairs from pseudo-labeled transcriptions using EuroLLM, followed by a data filtration pipeline. Designed for efficiency, our pipeline processes vast amount of data within hours. We assess models trained on processed data by comparing their performance on previously curated datasets for both high- and low-resource languages. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/parakeet-tdt-0.6b-v3model· 254k dl· ♡ 747254k dl♡ 747
- 🤗nvidia/parakeet-tdt-0.6b-v2model· 164k dl· ♡ 1444164k dl♡ 1444
- 🤗nvidia/canary-qwen-2.5bmodel· 144k dl· ♡ 404144k dl♡ 404
- 🤗nvidia/canary-1b-v2model· 123k dl· ♡ 371123k dl♡ 371
- 🤗manueljohnson063/canary-qwen-2.5bmodel· 9 dl9 dl
- 🤗SoSolaris/parakeet-tdt-0.6b-v3model· 7 dl7 dl
- 🤗ManuelZnnmc/parakeet-tdt-0.6b-v3model· 1 dl1 dl
- 🤗MadnessOverflow/parakeet-tdt-0.6b-v3-bpe-vocabmodel
- 🤗Endy2001/parakeet-tdt-0.6b-v3model· 3 dl3 dl
- 🤗everyscribe/parakeet-tdt-0.6b-v3model· 9 dl9 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis
