The Multilingual TEDx Corpus for Speech Recognition and Translation

Elizabeth Salesky; Matthew Wiesner; Jacob Bremerman; Roldano Cattoni,; Matteo Negri; Marco Turchi; Douglas W. Oard; Matt Post

arXiv:2102.01757·cs.CL·June 16, 2021

The Multilingual TEDx Corpus for Speech Recognition and Translation

Elizabeth Salesky, Matthew Wiesner, Jacob Bremerman, Roldano Cattoni,, Matteo Negri, Marco Turchi, Douglas W. Oard, Matt Post

PDF

1 Datasets

TL;DR

The paper introduces the Multilingual TEDx corpus, a resource for advancing speech recognition and translation research across multiple languages, with aligned audio, transcripts, and translations, supporting multilingual model development.

Contribution

It presents a new multilingual speech corpus with alignment and translation, enabling research in low-resource language translation and extending previous datasets.

Findings

01

Provides baseline results for ASR and ST tasks.

02

Demonstrates improved translation for low-resource languages.

03

Offers a scalable methodology for corpus creation.

Abstract

We present the Multilingual TEDx corpus, built to support speech recognition (ASR) and speech translation (ST) research across many non-English source languages. The corpus is a collection of audio recordings from TEDx talks in 8 source languages. We segment transcripts into sentences and align them to the source-language audio and target-language translations. The corpus is released along with open-sourced code enabling extension to new talks and languages as they become available. Our corpus creation methodology can be applied to more languages than previous work, and creates multi-way parallel evaluation sets. We provide baselines in multiple ASR and ST settings, including multilingual models to improve translation performance for low-resource language pairs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

dominguesm/mTEDx-ptbr
dataset· 388 dl
388 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.