TL;DR
This paper investigates hallucination issues in Multimodal Chain-of-Thought models, identifies their causes, and proposes an effective intervention strategy that improves visual reasoning accuracy.
Contribution
It systematically analyzes hallucination patterns in MCoT models and introduces a simple method to localize and mitigate divergent thinking-induced hallucinations.
Findings
The method significantly reduces hallucinations in MCoT models.
It outperforms existing hallucination mitigation techniques.
The approach can be combined with other methods for further improvements.
Abstract
Multimodal Chain-of-Thought (MCoT) models have demonstrated impressive capability in complex visual reasoning tasks. Unfortunately, recent studies reveal that they suffer from severe hallucination problems due to diminished visual attention during the generation process. However, visual attention decay is a well-studied problem in Large Vision-Language Models (LVLMs). Considering the fundamental differences in reasoning processes between MCoT models and traditional LVLMs, we raise a basic question: Whether MCoT models have unique causes of hallucinations? To answer this question, we systematically investigate the hallucination patterns of MCoT models and find that fabricated texts are primarily generated in associative reasoning steps, which we term divergent thinking. Leveraging these insights, we introduce a simple yet effective strategy that can effectively localize divergent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
