Can phones, syllables, and words emerge as side-products of   cross-situational audiovisual learning? -- A computational investigation

Khazar Khorrami; Okko R\"as\"anen

arXiv:2109.14200·eess.AS·March 12, 2024·5 cites

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? -- A computational investigation

Khazar Khorrami, Okko R\"as\"anen

PDF

Open Access 1 Repo

TL;DR

This study investigates whether basic linguistic units like phones, syllables, and words can emerge naturally as side-products of audiovisual cross-situational learning in neural networks, without explicit targeting of these units.

Contribution

It introduces the latent language hypothesis (LLH) and evaluates its validity through extensive simulations with neural networks on synthetic and real speech data.

Findings

01

Latent representations can reflect phonetic, syllabic, or lexical structures.

02

Audiovisual learning supports emergence of linguistic units.

03

Neural networks develop meaningful linguistic features without explicit supervision.

Abstract

Decades of research has studied how language learning infants learn to discriminate speech sounds, segment words, and associate words with their meanings. While gradual development of such capabilities is unquestionable, the exact nature of these skills and the underlying mental representations yet remains unclear. In parallel, computational studies have shown that basic comprehension of speech can be achieved by statistical learning between speech and concurrent referentially ambiguous visual input. These models can operate without prior linguistic knowledge such as representations of linguistic units, and without learning mechanisms specifically targeted at such units. This has raised the question of to what extent knowledge of linguistic units, such as phone(me)s, syllables, and words, could actually emerge as latent representations supporting the translation between speech and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

speechcog/vgs_xsl
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultisensory perception and integration · Music and Audio Processing · Animal Vocal Communication and Behavior