EgoAVU: Egocentric Audio-Visual Understanding
Ashish Seth, Xinhao Mei, Changsheng Zhao, Varun Nagaraja, Ernie Chang, Gregory P. Meyer, Gael Le Lan, Yunyang Xiong, Vikas Chandra, Yangyang Shi, Dinesh Manocha, Zhipeng Cai

TL;DR
EgoAVU introduces a scalable method for automatically generating egocentric audio-visual data, enabling large-scale training and revealing limitations of current models in understanding both modalities jointly.
Contribution
The paper presents EgoAVU, a novel data engine and dataset for egocentric audio-visual understanding, improving multi-modal model performance and exposing existing biases.
Findings
Finetuning on EgoAVU-Instruct improves performance by up to 113% on EgoAVU-Bench.
Models biased towards visual signals, neglecting audio cues.
Transfer gains observed on EgoTempo and EgoIllusion benchmarks.
Abstract
Understanding egocentric videos plays a vital role for embodied intelligence. Recent multi-modal large language models (MLLMs) can accept both visual and audio inputs. However, due to the challenge of obtaining text labels with coherent joint-modality information, whether MLLMs can jointly understand both modalities in egocentric videos remains under-explored. To address this problem, we introduce EgoAVU, a scalable data engine to automatically generate egocentric audio-visual narrations, questions, and answers. EgoAVU enriches human narrations with multimodal context and generates audio-visual narrations through cross-modal correlation modeling. Token-based video filtering and modular, graph-based curation ensure both data diversity and quality. Leveraging EgoAVU, we construct EgoAVU-Instruct, a large-scale training dataset of 3M samples, and EgoAVU-Bench, a manually verified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Music and Audio Processing
