Visually Grounded Speech Models for Low-resource Languages and Cognitive Modelling
Leanne Nortje

TL;DR
This dissertation explores visually grounded speech models that learn from unlabelled speech and images, focusing on low-resource languages and cognitive aspects like language acquisition and mutual exclusivity bias.
Contribution
It introduces a new task for keyword localisation using images and demonstrates VGS models' effectiveness in low-resource language learning and cognitive bias analysis.
Findings
VGS models perform well in few-shot learning for Yoruba.
Monolingual VGS models exhibit mutual exclusivity bias.
Multilingualism does not influence the bias in VGS models.
Abstract
This dissertation examines visually grounded speech (VGS) models that learn from unlabelled speech paired with images. It focuses on applications for low-resource languages and understanding human language acquisition. We introduce a task called visually prompted keyword localisation to detect and localise keywords in speech using images. We demonstrate the effectiveness of VGS models in few-shot learning scenarios for low-resource languages like Yoruba. Additionally, we examine the mutual exclusivity bias in VGS models. Our monolingual VGS model exhibits this bias, but we found that multilingualism does not affect the bias in this VGS model similarly to what is observed in children.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
