Europarl-ST: A Multilingual Corpus For Speech Translation Of   Parliamentary Debates

Javier Iranzo-S\'anchez; Joan Albert Silvestre-Cerd\`a; Javier Jorge,; Nahuel Rosell\'o; Adri\`a Gim\'enez; Albert Sanchis; Jorge Civera; Alfons; Juan

arXiv:1911.03167·cs.CL·February 13, 2020·19 cites

Europarl-ST: A Multilingual Corpus For Speech Translation Of Parliamentary Debates

Javier Iranzo-S\'anchez, Joan Albert Silvestre-Cerd\`a, Javier Jorge,, Nahuel Rosell\'o, Adri\`a Gim\'enez, Albert Sanchis, Jorge Civera, Alfons, Juan

PDF

Open Access

TL;DR

Europarl-ST is a new multilingual speech translation corpus derived from European Parliament debates, enabling research across 6 languages and 30 translation directions, and supporting advancements in speech-to-text translation.

Contribution

The paper introduces Europarl-ST, a comprehensive, freely accessible multilingual speech translation corpus for European languages, filling a critical resource gap in SLT research.

Findings

01

Demonstrated the corpus's utility through speech recognition experiments

02

Showcased machine translation performance on the new dataset

03

Highlighted potential for improving multilingual speech translation

Abstract

Current research into spoken language translation (SLT),or speech-to-text translation, is often hampered by the lack of specific data resources for this task, as currently available SLT datasets are restricted to a limited set of language pairs. In this paper we present Europarl-ST, a novel multilingual SLT corpus containing paired audio-text samples for SLT from and into 6 European languages, for a total of 30 different translation directions. This corpus has been compiled using the debates held in the European Parliament in the period between 2008 and 2012. This paper describes the corpus creation process and presents a series of automatic speech recognition, machine translation and spoken language translation experiments that highlight the potential of this new resource. The corpus is released under a Creative Commons license and is freely accessible and downloadable.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis