Low-Resource Speech-to-Text Translation
Sameer Bansal, Herman Kamper, Karen Livescu, Adam Lopez, Sharon, Goldwater

TL;DR
This paper explores direct speech-to-text translation for low-resource languages using neural models, focusing on efficiency and performance with limited data and computational resources.
Contribution
It demonstrates that word-level decoding improves efficiency and enables effective translation with limited data and resources in low-resource settings.
Findings
Models trained on 50 hours achieve ~50% precision and recall.
Models trained on 160 hours achieve ~60% precision and recall.
Direct translation can be effective even with limited data.
Abstract
Speech-to-text translation has many potential applications for low-resource languages, but the typical approach of cascading speech recognition with machine translation is often impossible, since the transcripts needed to train a speech recognizer are usually not available for low-resource languages. Recent work has found that neural encoder-decoder models can learn to directly translate foreign speech in high-resource scenarios, without the need for intermediate transcription. We investigate whether this approach also works in settings where both data and computation are limited. To make the approach efficient, we make several architectural changes, including a change from character-level to word-level decoding. We find that this choice yields crucial speed improvements that allow us to train with fewer computational resources, yet still performs well on frequent words. We explore…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
