Piculet: Specialized Models-Guided Hallucination Decrease for MultiModal Large Language Models
Kohou Wang, Xiang Liu, Zhaoxiang Liu, Kai Wang, Shiguo Lian

TL;DR
Piculet is a training-free approach that reduces hallucinations in multimodal large language models by using specialized models to extract and incorporate detailed visual descriptions into the input, improving alignment without retraining.
Contribution
Introducing Piculet, a universal, training-free method that enhances MLLMs' input with visual descriptions from specialized models to decrease hallucinations.
Findings
Significantly reduces hallucinations in MLLMs.
Effective across different MLLMs without retraining.
Improves alignment between image content and generated text.
Abstract
Multimodal Large Language Models (MLLMs) have made significant progress in bridging the gap between visual and language modalities. However, hallucinations in MLLMs, where the generated text does not align with image content, continue to be a major challenge. Existing methods for addressing hallucinations often rely on instruction-tuning, which requires retraining the model with specific data, which increases the cost of utilizing MLLMs further. In this paper, we introduce a novel training-free method, named Piculet, for enhancing the input representation of MLLMs. Piculet leverages multiple specialized models to extract descriptions of visual information from the input image and combine these descriptions with the original image and query as input to the MLLM. We evaluate our method both quantitively and qualitatively, and the results demonstrate that Piculet greatly decreases…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsALIGN
