The Multilingual TEDx Corpus for Speech Recognition and Translation
Elizabeth Salesky, Matthew Wiesner, Jacob Bremerman, Roldano Cattoni,, Matteo Negri, Marco Turchi, Douglas W. Oard, Matt Post

TL;DR
The paper introduces the Multilingual TEDx corpus, a resource for advancing speech recognition and translation research across multiple languages, with aligned audio, transcripts, and translations, supporting multilingual model development.
Contribution
It presents a new multilingual speech corpus with alignment and translation, enabling research in low-resource language translation and extending previous datasets.
Findings
Provides baseline results for ASR and ST tasks.
Demonstrates improved translation for low-resource languages.
Offers a scalable methodology for corpus creation.
Abstract
We present the Multilingual TEDx corpus, built to support speech recognition (ASR) and speech translation (ST) research across many non-English source languages. The corpus is a collection of audio recordings from TEDx talks in 8 source languages. We segment transcripts into sentences and align them to the source-language audio and target-language translations. The corpus is released along with open-sourced code enabling extension to new talks and languages as they become available. Our corpus creation methodology can be applied to more languages than previous work, and creates multi-way parallel evaluation sets. We provide baselines in multiple ASR and ST settings, including multilingual models to improve translation performance for low-resource language pairs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
