VidHal: Benchmarking Temporal Hallucinations in Vision LLMs
Wey Yeh Choong, Yangyang Guo, Mohan Kankanhalli

TL;DR
VidHal is a new benchmark designed to evaluate and analyze video-based hallucinations in Vision Large Language Models, highlighting their limitations and guiding future improvements.
Contribution
The paper introduces VidHal, a novel benchmark with a caption ordering task to assess hallucinations in VLLMs on videos, addressing limitations of existing evaluation methods.
Findings
Existing VLLMs show significant hallucination issues on videos.
VidHal reveals models' limitations in handling spatiotemporal information.
Benchmark encourages development of more accurate VLLMs for video understanding.
Abstract
Vision Large Language Models (VLLMs) are widely acknowledged to be prone to hallucinations. Existing research addressing this problem has primarily been confined to image inputs, with limited exploration of video-based hallucinations. Furthermore, current evaluation methods fail to capture nuanced errors in generated responses, which are often exacerbated by the rich spatiotemporal dynamics of videos. To address this, we introduce VidHal, a benchmark specially designed to evaluate video-based hallucinations in VLLMs. VidHal is constructed by bootstrapping video instances across a wide range of common temporal aspects. A defining feature of our benchmark lies in the careful creation of captions which represent varying levels of hallucination associated with each video. To enable fine-grained evaluation, we propose a novel caption ordering task requiring VLLMs to rank captions by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
