EgoSound: Benchmarking Sound Understanding in Egocentric Videos

Bingwen Zhu; Yuqian Fu; Qiaole Dong; Guolei Sun; Tianwen Qian; Yuzheng Wu; Danda Pani Paudel; Xiangyang Xue; and Yanwei Fu

arXiv:2602.14122·cs.CV·April 21, 2026

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

Bingwen Zhu, Yuqian Fu, Qiaole Dong, Guolei Sun, Tianwen Qian, Yuzheng Wu, Danda Pani Paudel, Xiangyang Xue, and Yanwei Fu

PDF

1 Repo 1 Datasets

TL;DR

EgoSound is a new benchmark designed to evaluate egocentric sound understanding in multimodal large language models, highlighting current models' strengths and limitations in multisensory perception.

Contribution

It introduces the first comprehensive benchmark for egocentric sound understanding, combining data from Ego4D and EgoBlind, with a seven-task taxonomy and extensive QA pairs.

Findings

01

Current models show emerging auditory reasoning abilities.

02

Models are limited in fine-grained spatial and causal understanding.

03

EgoSound provides a challenging foundation for multisensory egocentric intelligence.

Abstract

Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in vision-language understanding. Yet, human perception is inherently multisensory, integrating sight, sound, and motion to reason about the world. Among these modalities, sound provides indispensable cues about spatial layout, off-screen events, and causal interactions, particularly in egocentric settings where auditory and visual signals are tightly coupled. To this end, we introduce EgoSound, the first benchmark designed to systematically evaluate egocentric sound understanding in MLLMs. EgoSound unifies data from Ego4D and EgoBlind, encompassing both sighted and sound-dependent experiences. It defines a seven-task taxonomy spanning intrinsic sound perception, spatial localization, causal inference, and cross-modal reasoning. Constructed through a multi-stage auto-generative pipeline, EgoSound…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://groolegend.github.io/EgoSound
github

Datasets

grooLegend/EgoSound
dataset· 227 dl
227 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.