Cure or Poison? Embedding Instructions Visually Alters Hallucination in Vision-Language Models
Zhaochen Wang, Yiwei Wang, Yujun Cai

TL;DR
This paper introduces Prompt-in-Image, embedding instructions into images to improve vision-language model performance and reduce hallucinations, with varied effects across different models.
Contribution
The paper presents Prompt-in-Image, a novel method that embeds textual instructions into images to unify modality processing in vision-language models.
Findings
Improves Qwen2.5-VL accuracy by 4.1% and reduces hallucinations.
Causes performance drop in LLaVA-1.5 and InstructBLIP.
Reduces modality gap in Qwen, enhancing cross-modal alignment.
Abstract
Vision-Language Models (VLMs) often suffer from hallucination, partly due to challenges in aligning multimodal information. We propose Prompt-in-Image, a simple method that embeds textual instructions directly into images. This removes the need for separate text inputs and forces the model to process all content through the visual channel. We evaluate this method on three popular open-source VLMs: Qwen2.5-VL, LLaVA-1.5, and InstructBLIP. The results reveal sharp differences. Prompt-in-Image improves Qwen2.5-VL's performance, increasing POPE accuracy by 4.1 percent (from 80.2 percent to 84.3 percent) and also reducing hallucination rates on MS-COCO. In contrast, LLaVA-1.5 and InstructBLIP experience a severe performance drop, with accuracy falling from around 84 percent to near-random levels. Through detailed analysis, we found that CLIP-based encoders in LLaVA and InstructBLIP exhibit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Face Recognition and Perception
