Thinking Hallucination for Video Captioning
Nasib Ullah, Partha Pratim Mohanta

TL;DR
This paper investigates the root causes of hallucination in video captioning models, proposing solutions like auxiliary heads and context gates, and introduces a new metric COAHA to better evaluate hallucination, achieving state-of-the-art results.
Contribution
It identifies key factors causing hallucination in video captioning and proposes novel methods to mitigate them, along with a new evaluation metric for hallucination.
Findings
Proposed auxiliary heads improve visual feature robustness.
Context gates enhance feature fusion during captioning.
Achieved state-of-the-art results on MSR-VTT and MSVD datasets.
Abstract
With the advent of rich visual representations and pre-trained language models, video captioning has seen continuous improvement over time. Despite the performance improvement, video captioning models are prone to hallucination. Hallucination refers to the generation of highly pathological descriptions that are detached from the source material. In video captioning, there are two kinds of hallucination: object and action hallucination. Instead of endeavoring to learn better representations of a video, in this work, we investigate the fundamental sources of the hallucination problem. We identify three main factors: (i) inadequate visual features extracted from pre-trained models, (ii) improper influences of source and target contexts during multi-modal fusion, and (iii) exposure bias in the training strategy. To alleviate these problems, we propose two robust solutions: (a) the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
