Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? -- A computational investigation
Khazar Khorrami, Okko R\"as\"anen

TL;DR
This study investigates whether basic linguistic units like phones, syllables, and words can emerge naturally as side-products of audiovisual cross-situational learning in neural networks, without explicit targeting of these units.
Contribution
It introduces the latent language hypothesis (LLH) and evaluates its validity through extensive simulations with neural networks on synthetic and real speech data.
Findings
Latent representations can reflect phonetic, syllabic, or lexical structures.
Audiovisual learning supports emergence of linguistic units.
Neural networks develop meaningful linguistic features without explicit supervision.
Abstract
Decades of research has studied how language learning infants learn to discriminate speech sounds, segment words, and associate words with their meanings. While gradual development of such capabilities is unquestionable, the exact nature of these skills and the underlying mental representations yet remains unclear. In parallel, computational studies have shown that basic comprehension of speech can be achieved by statistical learning between speech and concurrent referentially ambiguous visual input. These models can operate without prior linguistic knowledge such as representations of linguistic units, and without learning mechanisms specifically targeted at such units. This has raised the question of to what extent knowledge of linguistic units, such as phone(me)s, syllables, and words, could actually emerge as latent representations supporting the translation between speech and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultisensory perception and integration · Music and Audio Processing · Animal Vocal Communication and Behavior
