Exploring Audio Hallucination in Egocentric Video Understanding

Ashish Seth; Xinhao Mei; Changsheng Zhao; Varun Nagaraja; Ernie Chang; Gregory P. Meyer; Gael Le Lan; Yunyang Xiong; Vikas Chandra; Yangyang Shi; Dinesh Manocha; Zhipeng Cai

arXiv:2604.23860·cs.CV·April 28, 2026

Exploring Audio Hallucination in Egocentric Video Understanding

Ashish Seth, Xinhao Mei, Changsheng Zhao, Varun Nagaraja, Ernie Chang, Gregory P. Meyer, Gael Le Lan, Yunyang Xiong, Vikas Chandra, Yangyang Shi, Dinesh Manocha, Zhipeng Cai

PDF

TL;DR

This paper investigates audio hallucinations in egocentric video understanding, revealing that current AV-LLMs often generate inaccurate sounds, and introduces an evaluation framework and dataset to analyze and quantify these hallucinations.

Contribution

It presents a systematic evaluation framework, a curated dataset, and a taxonomy for analyzing audio hallucinations in egocentric videos, highlighting the unreliability of current AV-LLMs.

Findings

01

Advanced AV-LLMs like Qwen2.5 Omni have high hallucination rates.

02

Models achieve only 27.3% accuracy on foreground sound questions.

03

Models achieve only 39.5% accuracy on background sound questions.

Abstract

Egocentric videos provide a distinctive setting in which sound serves as crucial cues to understand user activities and surroundings, particularly when visual information is unstable or occluded due to continuous camera movement. State-of-the-art large audio-visual language models (AV-LLMs) can generate multimodal descriptions. However, we show in this work that they are prone to audio hallucinations, often inferring sounds from visual cues that are visible but not heard. We present a systematic and automatic evaluation framework for analyzing audio hallucinations in egocentric video through a targeted question-answering (Q/A) protocol. We curate a dataset of 300 egocentric videos and design 1,000 sound-focused questions to probe model outputs. To characterize hallucinations, we propose a grounded taxonomy that distinguishes between foreground action sounds from the user activities and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.