End-to-End Spoken Language Translation
Michelle Guo, Albert Haque, Prateek Verma

TL;DR
This paper introduces an end-to-end model for spoken language translation that directly converts speech in one language to speech in another, trained from scratch and capable of generalizing to unseen speakers.
Contribution
The proposed model combines pyramidal-bidirectional RNNs with convolutional networks for direct speech-to-speech translation, enabling training from scratch and speaker generalization.
Findings
Achieves competitive performance with state-of-the-art methods
Can be trained completely from scratch
Generalizes well to unseen speakers
Abstract
In this paper, we address the task of spoken language understanding. We present a method for translating spoken sentences from one language into spoken sentences in another language. Given spectrogram-spectrogram pairs, our model can be trained completely from scratch to translate unseen sentences. Our method consists of a pyramidal-bidirectional recurrent network combined with a convolutional network to output sentence-level spectrograms in the target language. Empirically, our model achieves competitive performance with state-of-the-art methods on multiple languages and can generalize to unseen speakers.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
