Vision-Driven Prompt Optimization for Large Language Models in Multimodal Generative Tasks
Leo Franklin, Apiradee Boonmee, Kritsada Wongsuwan

TL;DR
This paper introduces VDPO, a novel framework that uses large language models to generate prompts from visual inputs, significantly improving multimodal image synthesis quality across various benchmarks.
Contribution
The paper presents a new vision-driven prompt optimization framework that integrates visual and textual modules to enhance image generation performance.
Findings
VDPO outperforms existing methods on COCO and Sketchy benchmarks.
Achieves significant improvements in FID, LPIPS, BLEU, and CIDEr scores.
Demonstrates robustness, scalability, and generalization in diverse tasks.
Abstract
Vision generation remains a challenging frontier in artificial intelligence, requiring seamless integration of visual understanding and generative capabilities. In this paper, we propose a novel framework, Vision-Driven Prompt Optimization (VDPO), that leverages Large Language Models (LLMs) to dynamically generate textual prompts from visual inputs, guiding high-fidelity image synthesis. VDPO combines a visual embedding prompt tuner, a textual instruction generator, and a vision generation module to achieve state-of-the-art performance in diverse vision generation tasks. Extensive experiments on benchmarks such as COCO and Sketchy demonstrate that VDPO consistently outperforms existing methods, achieving significant improvements in FID, LPIPS, and BLEU/CIDEr scores. Additional analyses reveal the scalability, robustness, and generalization capabilities of VDPO, making it a versatile…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Multimodal Machine Learning Applications · Natural Language Processing Techniques
