EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data

Dongyan Lin; Phillip Rust; Angel Villar Corrales; Alvin W. M. Tan; Mahi Luthra; Charles-\'Eric Saint-James; Rashel Moritz; Sheila Krogh-Jespersen; Vanessa Stark; Surya Parimi; Jiayi Shen; Youssef Benchekroun; Yosuke Higuchi; Martin Gleize; Tom Fizycki; Nicolas Hamilakis; Manel Khentout; Sho Tsuji; Bal\'azs K\'egl; Juan Pino; Michael C. Frank; and Emmanuel Dupoux

arXiv:2605.19130·cs.LG·May 20, 2026

EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data

Dongyan Lin, Phillip Rust, Angel Villar Corrales, Alvin W. M. Tan, Mahi Luthra, Charles-\'Eric Saint-James, Rashel Moritz, Sheila Krogh-Jespersen, Vanessa Stark, Surya Parimi, Jiayi Shen, Youssef Benchekroun, Yosuke Higuchi, Martin Gleize, Tom Fizycki, Nicolas Hamilakis

PDF

TL;DR

This paper introduces EgoBabyVLM, a benchmark and challenge for evaluating and advancing vision-language models on naturalistic egocentric video data, highlighting current models' limitations in weakly-aligned, real-world scenarios.

Contribution

It presents a new benchmark suite and evaluation pipeline for models trained on egocentric videos, emphasizing the gap between current models and human-like language grounding in naturalistic settings.

Findings

01

Current VLMs rely on tightly aligned data and struggle with weakly-aligned egocentric input.

02

The Machine-DevBench benchmark measures lexical and grammatical competence across frequency bins.

03

Models trained on naturalistic data underperform compared to those trained on curated datasets.

Abstract

Children acquire language grounding with remarkable robustness from limited visuo-linguistic input in ways that surpass today's best large multimodal models. Recent research suggests current vision-language models (VLMs) trained on curated web data fail to generalize to the sparse, weakly-aligned egocentric streams produced by wearable devices, embodied agents, and infant head-cams -- and no fixed evaluation pipeline exists for measuring progress on this regime. We train VLMs on datasets with varying degrees of semantic alignment between visual and linguistic inputs, including naturalistic infant and adult egocentric videos, and evaluate them with a comprehensive suite spanning multimodal language grounding and unimodal vision and language tasks. At the core of this suite is Machine-DevBench, a corpus-grounded benchmark of lexical and grammatical competence, automatically generated from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.