YFACC: A Yor\`ub\'a speech-image dataset for cross-lingual keyword   localisation through visual grounding

Kayode Olaleye; Dan Oneata; Herman Kamper

arXiv:2210.04600·cs.CL·October 13, 2022

YFACC: A Yor\`ub\'a speech-image dataset for cross-lingual keyword localisation through visual grounding

Kayode Olaleye, Dan Oneata, Herman Kamper

PDF

Open Access

TL;DR

This paper introduces YFACC, a new Yorùbá speech-image dataset, enabling cross-lingual keyword localization in low-resource languages using visually grounded speech models, and demonstrates its potential for supporting unwritten language documentation.

Contribution

The paper presents a novel Yorùbá speech-image dataset and a cross-lingual VGS model, addressing the lack of resources for low-resource languages in visual grounding research.

Findings

01

Successful collection of a Yorùbá speech-image dataset with 6,000 images.

02

Demonstration of cross-lingual keyword localization from English queries to Yorùbá speech.

03

Comparison showing the impact of dataset size on model performance.

Abstract

Visually grounded speech (VGS) models are trained on images paired with unlabelled spoken captions. Such models could be used to build speech systems in settings where it is impossible to get labelled data, e.g. for documenting unwritten languages. However, most VGS studies are in English or other high-resource languages. This paper attempts to address this shortcoming. We collect and release a new single-speaker dataset of audio captions for 6k Flickr images in Yor\`ub\'a -- a real low-resource language spoken in Nigeria. We train an attention-based VGS model where images are automatically tagged with English visual labels and paired with Yor\`ub\'a utterances. This enables cross-lingual keyword localisation: a written English query is detected and located in Yor\`ub\'a speech. To quantify the effect of the smaller dataset, we compare to English systems trained on similar and more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Video Analysis and Summarization