Modelling word learning and recognition using visually grounded speech
Danny Merkx, Sebastiaan Scholten, Stefan L. Frank, Mirjam Ernestus and, Odette Scharenborg

TL;DR
This paper explores a visually grounded speech model that learns to recognize words from visual and spoken input without prior knowledge, demonstrating human-like word recognition effects and analyzing the impact of vector quantisation.
Contribution
It investigates the time-course of word recognition in a grounded speech model and assesses the role of vector quantisation in word discovery and recognition.
Findings
Model recognizes nouns and differentiates singular/plural forms.
Recognition influenced by word competition effects similar to humans.
Vector quantisation does not improve word recognition and may require more input.
Abstract
Background: Computational models of speech recognition often assume that the set of target words is already given. This implies that these models do not learn to recognise speech from scratch without prior knowledge and explicit supervision. Visually grounded speech models learn to recognise speech without prior knowledge by exploiting statistical dependencies between spoken and visual input. While it has previously been shown that visually grounded speech models learn to recognise the presence of words in the input, we explicitly investigate such a model as a model of human speech recognition. Methods: We investigate the time-course of word recognition as simulated by the model using a gating paradigm to test whether its recognition is affected by well-known word-competition effects in human speech processing. We furthermore investigate whether vector quantisation, a technique for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing
