Hallucination Localization in Video Captioning
Shota Nakada, Kazuhiro Saito, Yuchi Ishikawa, Hokuto Munakata, Tatsuya Komatsu, Masayoshi Kondo

TL;DR
This paper introduces a new task of hallucination localization in video captioning, providing a detailed span-level analysis and a benchmark dataset to evaluate current models' ability to identify hallucinations.
Contribution
It proposes the first span-level hallucination localization task, creates the HLVC-Dataset, and benchmarks existing methods for this new problem.
Findings
Baseline methods show room for improvement in hallucination localization.
HLVC-Dataset enables detailed evaluation of hallucination detection.
Quantitative and qualitative analyses highlight current challenges.
Abstract
We propose a novel task, hallucination localization in video captioning, which aims to identify hallucinations in video captions at the span level (i.e. individual words or phrases). This allows for a more detailed analysis of hallucinations compared to existing sentence-level hallucination detection task. To establish a benchmark for hallucination localization, we construct HLVC-Dataset, a carefully curated dataset created by manually annotating 1,167 video-caption pairs from VideoLLM-generated captions. We further implement a VideoLLM-based baseline method and conduct quantitative and qualitative evaluations to benchmark current performance on hallucination localization.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
