Modelling word learning and recognition using visually grounded speech

Danny Merkx; Sebastiaan Scholten; Stefan L. Frank; Mirjam Ernestus and; Odette Scharenborg

arXiv:2203.06937·cs.CL·March 15, 2022

Modelling word learning and recognition using visually grounded speech

Danny Merkx, Sebastiaan Scholten, Stefan L. Frank, Mirjam Ernestus and, Odette Scharenborg

PDF

Open Access 1 Repo

TL;DR

This paper explores a visually grounded speech model that learns to recognize words from visual and spoken input without prior knowledge, demonstrating human-like word recognition effects and analyzing the impact of vector quantisation.

Contribution

It investigates the time-course of word recognition in a grounded speech model and assesses the role of vector quantisation in word discovery and recognition.

Findings

01

Model recognizes nouns and differentiates singular/plural forms.

02

Recognition influenced by word competition effects similar to humans.

03

Vector quantisation does not improve word recognition and may require more input.

Abstract

Background: Computational models of speech recognition often assume that the set of target words is already given. This implies that these models do not learn to recognise speech from scratch without prior knowledge and explicit supervision. Visually grounded speech models learn to recognise speech without prior knowledge by exploiting statistical dependencies between spoken and visual input. While it has previously been shown that visually grounded speech models learn to recognise the presence of words in the input, we explicitly investigate such a model as a model of human speech recognition. Methods: We investigate the time-course of word recognition as simulated by the model using a gating paradigm to test whether its recognition is affected by well-known word-competition effects in human speech processing. We furthermore investigate whether vector quantisation, a technique for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

DannyMerkx/speech2image
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing