Visually Grounded Keyword Detection and Localisation for Low-Resource   Languages

Kayode Kolawole Olaleye

arXiv:2302.00765·cs.CL·February 3, 2023

Visually Grounded Keyword Detection and Localisation for Low-Resource Languages

Kayode Kolawole Olaleye

PDF

Open Access

TL;DR

This paper explores the use of Visually Grounded Speech models for keyword localisation in low-resource languages, demonstrating promising results and introducing a new Yoruba dataset for cross-lingual evaluation.

Contribution

It proposes four methods for keyword localisation with VGS models, evaluates them on English and Yoruba, and provides insights into cross-lingual transfer and challenges in low-resource settings.

Findings

01

Best method achieves 57% accuracy on English data

02

Cross-lingual localisation precision reaches 16% in Yoruba

03

Pretraining on English improves Yoruba localisation performance

Abstract

This study investigates the use of Visually Grounded Speech (VGS) models for keyword localisation in speech. The study focusses on two main research questions: (1) Is keyword localisation possible with VGS models and (2) Can keyword localisation be done cross-lingually in a real low-resource setting? Four methods for localisation are proposed and evaluated on an English dataset, with the best-performing method achieving an accuracy of 57%. A new dataset containing spoken captions in Yoruba language is also collected and released for cross-lingual keyword localisation. The cross-lingual model obtains a precision of 16% in actual keyword localisation and this performance can be improved by initialising from a model pretrained on English data. The study presents a detailed analysis of the model's success and failure modes and highlights the challenges of using VGS models for keyword…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Video Analysis and Summarization · Speech and Audio Processing