Effectively Enhancing Vision Language Large Models by Prompt Augmentation and Caption Utilization
Minyi Zhao, Jie Wang, Zhaoyang Li, Jiyuan Zhang, Zhenbang Sun,, Shuigeng Zhou

TL;DR
This paper introduces PACU, a novel instruct-tuning framework that enhances Vision Language Large Models by automatically augmenting prompts and utilizing image captions, effectively reducing hallucinations and improving response accuracy.
Contribution
The paper proposes PACU, a new method combining prompt augmentation and caption utilization to improve VLLM performance and mitigate hallucination issues.
Findings
PACU significantly reduces hallucination in VLLMs.
Enhanced prompt diversity improves model robustness.
Utilizing image captions aids in accurate response generation.
Abstract
Recent studies have shown that Vision Language Large Models (VLLMs) may output content not relevant to the input images. This problem, called the hallucination phenomenon, undoubtedly degrades VLLM performance. Therefore, various anti-hallucination techniques have been proposed to make model output more reasonable and accurate. Despite their successes, from extensive tests we found that augmenting the prompt (e.g. word appending, rewriting, and spell error etc.) may change model output and make the output hallucinate again. To cure this drawback, we propose a new instruct-tuning framework called Prompt Augmentation and Caption Utilization (PACU) to boost VLLM's generation ability under the augmented prompt scenario. Concretely, on the one hand, PACU exploits existing LLMs to augment and evaluate diverse prompts automatically. The resulting high-quality prompts are utilized to enhance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
