Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective
Zihao Yue, Liang Zhang, Qin Jin

TL;DR
This paper investigates how large multimodal models decide when to stop generating content and proposes methods to reduce hallucinations by improving EOS decision-making and filtering training data.
Contribution
It introduces a new perspective on multimodal hallucinations focusing on EOS decision processes and offers two effective mitigation strategies without extra data.
Findings
Improved hallucination reduction using new training objectives
Effective data filtering strategy to prevent hallucinations
Models better assess visual content to decide generation termination
Abstract
Large Multimodal Models (LMMs) often suffer from multimodal hallucinations, wherein they may create content that is not present in the visual inputs. In this paper, we explore a new angle of this issue: overly detailed training data hinders the model's ability to timely terminate generation, leading to continued outputs beyond visual perception limits. By investigating how the model decides to terminate generation with EOS, the special end-of-sentence token, we find that the model assesses the completeness of the entire sequence by comparing the generated text with the image. This observation suggests that the model possesses an inherent potential of making proper EOS decisions based on its visual perception to avoid overly lengthy outputs. To take advantage of such potential, we explore two methods to mitigate multimodal hallucinations: a training objective that enables the model to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHallucinations in medical conditions · Mental Health and Psychiatry
