EgoAVU: Egocentric Audio-Visual Understanding

Ashish Seth; Xinhao Mei; Changsheng Zhao; Varun Nagaraja; Ernie Chang; Gregory P. Meyer; Gael Le Lan; Yunyang Xiong; Vikas Chandra; Yangyang Shi; Dinesh Manocha; Zhipeng Cai

arXiv:2602.06139·cs.CV·February 9, 2026

EgoAVU: Egocentric Audio-Visual Understanding

Ashish Seth, Xinhao Mei, Changsheng Zhao, Varun Nagaraja, Ernie Chang, Gregory P. Meyer, Gael Le Lan, Yunyang Xiong, Vikas Chandra, Yangyang Shi, Dinesh Manocha, Zhipeng Cai

PDF

Open Access 2 Datasets

TL;DR

EgoAVU introduces a scalable method for automatically generating egocentric audio-visual data, enabling large-scale training and revealing limitations of current models in understanding both modalities jointly.

Contribution

The paper presents EgoAVU, a novel data engine and dataset for egocentric audio-visual understanding, improving multi-modal model performance and exposing existing biases.

Findings

01

Finetuning on EgoAVU-Instruct improves performance by up to 113% on EgoAVU-Bench.

02

Models biased towards visual signals, neglecting audio cues.

03

Transfer gains observed on EgoTempo and EgoIllusion benchmarks.

Abstract

Understanding egocentric videos plays a vital role for embodied intelligence. Recent multi-modal large language models (MLLMs) can accept both visual and audio inputs. However, due to the challenge of obtaining text labels with coherent joint-modality information, whether MLLMs can jointly understand both modalities in egocentric videos remains under-explored. To address this problem, we introduce EgoAVU, a scalable data engine to automatically generate egocentric audio-visual narrations, questions, and answers. EgoAVU enriches human narrations with multimodal context and generates audio-visual narrations through cross-modal correlation modeling. Token-based video filtering and modular, graph-based curation ensure both data diversity and quality. Leveraging EgoAVU, we construct EgoAVU-Instruct, a large-scale training dataset of 3M samples, and EgoAVU-Bench, a manually verified…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Music and Audio Processing