Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning
Shengqiong Wu, Hao Fei, Liangming Pan, William Yang Wang, Shuicheng, Yan, Tat-Seng Chua

TL;DR
This paper introduces a bottom-up reasoning framework for multimodal large language models to reduce hallucinations by verifying visual and textual inputs with commonsense knowledge, leading to more reliable outputs.
Contribution
It proposes a novel holistic reasoning approach that combines perception and cognition-level verification to effectively combat hallucinations in MLLMs.
Findings
Significant improvements on hallucination benchmarks
Enhanced reliability of multimodal outputs
Effective handling of perception- and cognition-level hallucinations
Abstract
Recent advancements in multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing various vision-language tasks. However, MLLMs face significant challenges with hallucinations, and misleading outputs that do not align with the input data. While existing efforts are paid to combat MLLM hallucinations, several pivotal challenges are still unsolved. First, while current approaches aggressively focus on addressing errors at the perception level, another important type at the cognition level requiring factual commonsense can be overlooked. In addition, existing methods might fall short in finding a more effective way to represent visual input, which is yet a key bottleneck that triggers visual hallucinations. Moreover, MLLMs can frequently be misled by faulty textual inputs and cause hallucinations, while unfortunately, this type of issue has long been…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHallucinations in medical conditions
MethodsFocus · ALIGN
