YFACC: A Yor\`ub\'a speech-image dataset for cross-lingual keyword localisation through visual grounding
Kayode Olaleye, Dan Oneata, Herman Kamper

TL;DR
This paper introduces YFACC, a new Yorùbá speech-image dataset, enabling cross-lingual keyword localization in low-resource languages using visually grounded speech models, and demonstrates its potential for supporting unwritten language documentation.
Contribution
The paper presents a novel Yorùbá speech-image dataset and a cross-lingual VGS model, addressing the lack of resources for low-resource languages in visual grounding research.
Findings
Successful collection of a Yorùbá speech-image dataset with 6,000 images.
Demonstration of cross-lingual keyword localization from English queries to Yorùbá speech.
Comparison showing the impact of dataset size on model performance.
Abstract
Visually grounded speech (VGS) models are trained on images paired with unlabelled spoken captions. Such models could be used to build speech systems in settings where it is impossible to get labelled data, e.g. for documenting unwritten languages. However, most VGS studies are in English or other high-resource languages. This paper attempts to address this shortcoming. We collect and release a new single-speaker dataset of audio captions for 6k Flickr images in Yor\`ub\'a -- a real low-resource language spoken in Nigeria. We train an attention-based VGS model where images are automatically tagged with English visual labels and paired with Yor\`ub\'a utterances. This enables cross-lingual keyword localisation: a written English query is detected and located in Yor\`ub\'a speech. To quantify the effect of the smaller dataset, we compare to English systems trained on similar and more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Video Analysis and Summarization
