Effectively Enhancing Vision Language Large Models by Prompt   Augmentation and Caption Utilization

Minyi Zhao; Jie Wang; Zhaoyang Li; Jiyuan Zhang; Zhenbang Sun,; Shuigeng Zhou

arXiv:2409.14484·cs.CV·September 24, 2024

Effectively Enhancing Vision Language Large Models by Prompt Augmentation and Caption Utilization

Minyi Zhao, Jie Wang, Zhaoyang Li, Jiyuan Zhang, Zhenbang Sun,, Shuigeng Zhou

PDF

Open Access

TL;DR

This paper introduces PACU, a novel instruct-tuning framework that enhances Vision Language Large Models by automatically augmenting prompts and utilizing image captions, effectively reducing hallucinations and improving response accuracy.

Contribution

The paper proposes PACU, a new method combining prompt augmentation and caption utilization to improve VLLM performance and mitigate hallucination issues.

Findings

01

PACU significantly reduces hallucination in VLLMs.

02

Enhanced prompt diversity improves model robustness.

03

Utilizing image captions aids in accurate response generation.

Abstract

Recent studies have shown that Vision Language Large Models (VLLMs) may output content not relevant to the input images. This problem, called the hallucination phenomenon, undoubtedly degrades VLLM performance. Therefore, various anti-hallucination techniques have been proposed to make model output more reasonable and accurate. Despite their successes, from extensive tests we found that augmenting the prompt (e.g. word appending, rewriting, and spell error etc.) may change model output and make the output hallucinate again. To cure this drawback, we propose a new instruct-tuning framework called Prompt Augmentation and Caption Utilization (PACU) to boost VLLM's generation ability under the augmented prompt scenario. Concretely, on the one hand, PACU exploits existing LLMs to augment and evaluate diverse prompts automatically. The resulting high-quality prompts are utilized to enhance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization