Translating speech with just images

Dan Oneata; Herman Kamper

arXiv:2406.07133·eess.AS·June 12, 2024

Translating speech with just images

Dan Oneata, Herman Kamper

PDF

Open Access 1 Repo

TL;DR

This paper presents a novel method for speech translation using images as an intermediary, enabling translation in low-resource languages by linking speech to text through visual grounding and captioning.

Contribution

It introduces a system that translates speech in low-resource languages into text by leveraging image captioning and pretrained models, with a focus on Yorùbá-to-English translation.

Findings

01

Captures main semantics of speech in translation

02

Diverse captioning reduces overfitting

03

Effective in low-resource language settings

Abstract

Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the audio in a different language from the generated captions. We investigate such a system on a real low-resource language, Yor\`ub\'a, and propose a Yor\`ub\'a-to-English speech translation model that leverages pretrained components in order to be able to learn in the low-resource regime. To limit overfitting, we find that it is essential to use a decoding scheme that produces diverse image captions for training. Results show that the predicted translations capture the main semantics of the spoken audio, albeit in a simpler and shorter form.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

danoneata/strim
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTranslation Studies and Practices