Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation
Alexandre Berard, Olivier Pietquin, Christophe Servan and, Laurent Besacier

TL;DR
This paper introduces an end-to-end speech-to-text translation model that bypasses source transcription, promising to simplify data collection especially for under-resourced languages, demonstrated on a small French-English dataset.
Contribution
It presents the first end-to-end speech-to-text translation model that does not rely on source language transcription during training or decoding.
Findings
Promising results on a French-English synthetic corpus.
Potential to simplify data collection for unwritten languages.
Demonstrates feasibility of direct speech-to-text translation.
Abstract
This paper proposes a first attempt to build an end-to-end speech-to-text translation system, which does not use source language transcription during learning or decoding. We propose a model for direct speech-to-text translation, which gives promising results on a small French-English synthetic corpus. Relaxing the need for source language transcription would drastically change the data collection methodology in speech translation, especially in under-resourced scenarios. For instance, in the former project DARPA TRANSTAC (speech translation from spoken Arabic dialects), a large effort was devoted to the collection of speech transcripts (and a prerequisite to obtain transcripts was often a detailed transcription guide for languages with little standardized spelling). Now, if end-to-end approaches for speech-to-text translation are successful, one might consider collecting data by asking…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Speech and dialogue systems
