Hallucination Mitigation Prompts Long-term Video Understanding
Yiwei Sun, Zhihang Liu, Chuanbin Liu, Bowei Pu, Zhihan Zhang, Hongtao, Xie

TL;DR
This paper introduces a hallucination mitigation pipeline for long video understanding using multimodal large language models, improving accuracy and reducing hallucinations in long video question answering tasks.
Contribution
It presents a novel pipeline combining frame sampling, question-guided visual feature extraction, and answer generation techniques to mitigate hallucinations in long video understanding.
Findings
Achieved 84.2% accuracy on MovieChat dataset
Surpassed baseline by 29.1% in global mode
Won third place in CVPR LOVEU 2024 challenge
Abstract
Recently, multimodal large language models have made significant advancements in video understanding tasks. However, their ability to understand unprocessed long videos is very limited, primarily due to the difficulty in supporting the enormous memory overhead. Although existing methods achieve a balance between memory and information by aggregating frames, they inevitably introduce the severe hallucination issue. To address this issue, this paper constructs a comprehensive hallucination mitigation pipeline based on existing MLLMs. Specifically, we use the CLIP Score to guide the frame sampling process with questions, selecting key frames relevant to the question. Then, We inject question information into the queries of the image Q-former to obtain more important visual features. Finally, during the answer generation stage, we utilize chain-of-thought and in-context learning techniques…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPsychedelics and Drug Studies · Hallucinations in medical conditions · Psychosomatic Disorders and Their Treatments
MethodsContrastive Language-Image Pre-training
