A model of early word acquisition based on realistic-scale audiovisual naming events
Khazar Khorrami, Okko R\"as\"anen

TL;DR
This study presents a model that simulates infant early word learning through statistical regularities in audiovisual input, demonstrating effective vocabulary acquisition without prior linguistic knowledge.
Contribution
The paper introduces a realistic-scale audiovisual model that learns word-object associations from raw sensory data, matching infant vocabulary growth rates.
Findings
Model successfully recognizes words and associates them with objects
Vocabulary growth rate aligns with infant development
Supports statistical learning as a mechanism for early word perception
Abstract
Infants gradually learn to parse continuous speech into words and connect names with objects, yet the mechanisms behind development of early word perception skills remain unknown. We studied the extent to which early words can be acquired through statistical learning from regularities in audiovisual sensory input. We simulated word learning in infants up to 12 months of age in a realistic setting, using a model that solely learns from statistical regularities in unannotated raw speech and pixel-level visual input. Crucially, the quantity of object naming events was carefully designed to match that accessible to infants of comparable ages. Results show that the model effectively learns to recognize words and associate them with corresponding visual objects, with a vocabulary growth rate comparable to that observed in infants. The findings support the viability of general statistical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Speech and dialogue systems
