Visually Grounded Keyword Detection and Localisation for Low-Resource Languages
Kayode Kolawole Olaleye

TL;DR
This paper explores the use of Visually Grounded Speech models for keyword localisation in low-resource languages, demonstrating promising results and introducing a new Yoruba dataset for cross-lingual evaluation.
Contribution
It proposes four methods for keyword localisation with VGS models, evaluates them on English and Yoruba, and provides insights into cross-lingual transfer and challenges in low-resource settings.
Findings
Best method achieves 57% accuracy on English data
Cross-lingual localisation precision reaches 16% in Yoruba
Pretraining on English improves Yoruba localisation performance
Abstract
This study investigates the use of Visually Grounded Speech (VGS) models for keyword localisation in speech. The study focusses on two main research questions: (1) Is keyword localisation possible with VGS models and (2) Can keyword localisation be done cross-lingually in a real low-resource setting? Four methods for localisation are proposed and evaluated on an English dataset, with the best-performing method achieving an accuracy of 57%. A new dataset containing spoken captions in Yoruba language is also collected and released for cross-lingual keyword localisation. The cross-lingual model obtains a precision of 16% in actual keyword localisation and this performance can be improved by initialising from a model pretrained on English data. The study presents a detailed analysis of the model's success and failure modes and highlights the challenges of using VGS models for keyword…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Video Analysis and Summarization · Speech and Audio Processing
