Visually Grounded Speech Models for Low-resource Languages and Cognitive   Modelling

Leanne Nortje

arXiv:2409.02865·cs.CL·September 5, 2024

Visually Grounded Speech Models for Low-resource Languages and Cognitive Modelling

Leanne Nortje

PDF

Open Access

TL;DR

This dissertation explores visually grounded speech models that learn from unlabelled speech and images, focusing on low-resource languages and cognitive aspects like language acquisition and mutual exclusivity bias.

Contribution

It introduces a new task for keyword localisation using images and demonstrates VGS models' effectiveness in low-resource language learning and cognitive bias analysis.

Findings

01

VGS models perform well in few-shot learning for Yoruba.

02

Monolingual VGS models exhibit mutual exclusivity bias.

03

Multilingualism does not influence the bias in VGS models.

Abstract

This dissertation examines visually grounded speech (VGS) models that learn from unlabelled speech paired with images. It focuses on applications for low-resource languages and understanding human language acquisition. We introduce a task called visually prompted keyword localisation to detect and localise keywords in speech using images. We demonstrate the effectiveness of VGS models in few-shot learning scenarios for low-resource languages like Yoruba. Additionally, we examine the mutual exclusivity bias in VGS models. Our monolingual VGS model exhibits this bias, but we found that multilingualism does not affect the bias in this VGS model similarly to what is observed in children.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems