Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques
Grzegorz Chrupa{\l}a

TL;DR
This survey reviews two decades of visually grounded spoken language models, highlighting datasets, architectures, and evaluation methods across multiple disciplines to inform future research.
Contribution
It provides a comprehensive overview of datasets, models, and evaluation techniques in visually grounded spoken language research, integrating insights from multiple fields.
Findings
Extensive datasets have enabled progress in the field.
Various modeling architectures have been developed and compared.
Evaluation metrics and analysis techniques are well-established.
Abstract
This survey provides an overview of the evolution of visually grounded models of spoken language over the last 20 years. Such models are inspired by the observation that when children pick up a language, they rely on a wide range of indirect and noisy clues, crucially including signals from the visual modality co-occurring with spoken utterances. Several fields have made important contributions to this approach to modeling or mimicking the process of learning language: Machine Learning, Natural Language and Speech Processing, Computer Vision and Cognitive Science. The current paper brings together these contributions in order to provide a useful introduction and overview for practitioners in all these areas. We discuss the central research questions addressed, the timeline of developments, and the datasets which enabled much of this work. We then summarize the main modeling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
