Assessing the alignment between infants' visual and linguistic experience using multimodal language models

Alvin Wei Ming Tan; Jane Yang; Tarun Sepuri; Khai Loong Aw; Robert Z. Sparks; Zi Yin; Virginia A. Marchman; Michael C. Frank; Bria Long

arXiv:2511.18824·cs.CV·November 25, 2025

Assessing the alignment between infants' visual and linguistic experience using multimodal language models

Alvin Wei Ming Tan, Jane Yang, Tarun Sepuri, Khai Loong Aw, Robert Z. Sparks, Zi Yin, Virginia A. Marchman, Michael C. Frank, Bria Long

PDF

Open Access

TL;DR

This study uses CLIP models to automatically analyze the alignment of visual and linguistic experiences in infants' everyday environments, revealing infrequent but critical moments for early word learning.

Contribution

It introduces a novel automated method using CLIP to assess vision-language alignment in infant videos, addressing limitations of manual annotation and providing new insights into early language acquisition.

Findings

01

Aligned moments are rare in infants' natural environments.

02

Variability in alignment exists both within and across children.

03

The method offers a new way to study multimodal learning environments.

Abstract

Figuring out which objects or concepts words refer to is a central language learning challenge for young children. Most models of this process posit that children learn early object labels from co-occurrences of words and their referents that occur when someone around them talks about an object in the immediate physical environment. But how aligned in time are children's visual and linguistic experiences during everyday learning? To date, answers to this question have been limited by the need for labor-intensive manual annotations of vision-language co-occurrences. Here, we evaluate the use of contrastive language-image pretraining (CLIP) models to automatically characterize vision-language alignment in egocentric videos taken from the infant perspective in home environments. After validating CLIP alignment scores using human alignment judgments, we apply this metric to a large corpus…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage Development and Disorders · Child and Animal Learning Development · Categorization, perception, and language