Translating speech with just images
Dan Oneata, Herman Kamper

TL;DR
This paper presents a novel method for speech translation using images as an intermediary, enabling translation in low-resource languages by linking speech to text through visual grounding and captioning.
Contribution
It introduces a system that translates speech in low-resource languages into text by leveraging image captioning and pretrained models, with a focus on Yorùbá-to-English translation.
Findings
Captures main semantics of speech in translation
Diverse captioning reduces overfitting
Effective in low-resource language settings
Abstract
Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the audio in a different language from the generated captions. We investigate such a system on a real low-resource language, Yor\`ub\'a, and propose a Yor\`ub\'a-to-English speech translation model that leverages pretrained components in order to be able to learn in the low-resource regime. To limit overfitting, we find that it is essential to use a decoding scheme that produces diverse image captions for training. Results show that the predicted translations capture the main semantics of the spoken audio, albeit in a simpler and shorter form.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTranslation Studies and Practices
