Thinking Hallucination for Video Captioning

Nasib Ullah; Partha Pratim Mohanta

arXiv:2209.13853·cs.CV·September 29, 2022

Thinking Hallucination for Video Captioning

Nasib Ullah, Partha Pratim Mohanta

PDF

Open Access 1 Repo

TL;DR

This paper investigates the root causes of hallucination in video captioning models, proposing solutions like auxiliary heads and context gates, and introduces a new metric COAHA to better evaluate hallucination, achieving state-of-the-art results.

Contribution

It identifies key factors causing hallucination in video captioning and proposes novel methods to mitigate them, along with a new evaluation metric for hallucination.

Findings

01

Proposed auxiliary heads improve visual feature robustness.

02

Context gates enhance feature fusion during captioning.

03

Achieved state-of-the-art results on MSR-VTT and MSVD datasets.

Abstract

With the advent of rich visual representations and pre-trained language models, video captioning has seen continuous improvement over time. Despite the performance improvement, video captioning models are prone to hallucination. Hallucination refers to the generation of highly pathological descriptions that are detached from the source material. In video captioning, there are two kinds of hallucination: object and action hallucination. Instead of endeavoring to learn better representations of a video, in this work, we investigate the fundamental sources of the hallucination problem. We identify three main factors: (i) inadequate visual features extracted from pre-trained models, (ii) improper influences of source and target contexts during multi-modal fusion, and (iii) exposure bias in the training strategy. To alleviate these problems, we propose two robust solutions: (a) the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nasib-ullah/THVC
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications