Vision-Driven Prompt Optimization for Large Language Models in   Multimodal Generative Tasks

Leo Franklin; Apiradee Boonmee; Kritsada Wongsuwan

arXiv:2501.02527·cs.CV·January 7, 2025

Vision-Driven Prompt Optimization for Large Language Models in Multimodal Generative Tasks

Leo Franklin, Apiradee Boonmee, Kritsada Wongsuwan

PDF

Open Access

TL;DR

This paper introduces VDPO, a novel framework that uses large language models to generate prompts from visual inputs, significantly improving multimodal image synthesis quality across various benchmarks.

Contribution

The paper presents a new vision-driven prompt optimization framework that integrates visual and textual modules to enhance image generation performance.

Findings

01

VDPO outperforms existing methods on COCO and Sketchy benchmarks.

02

Achieves significant improvements in FID, LPIPS, BLEU, and CIDEr scores.

03

Demonstrates robustness, scalability, and generalization in diverse tasks.

Abstract

Vision generation remains a challenging frontier in artificial intelligence, requiring seamless integration of visual understanding and generative capabilities. In this paper, we propose a novel framework, Vision-Driven Prompt Optimization (VDPO), that leverages Large Language Models (LLMs) to dynamically generate textual prompts from visual inputs, guiding high-fidelity image synthesis. VDPO combines a visual embedding prompt tuner, a textual instruction generator, and a vision generation module to achieve state-of-the-art performance in diverse vision generation tasks. Extensive experiments on benchmarks such as COCO and Sketchy demonstrate that VDPO consistently outperforms existing methods, achieving significant improvements in FID, LPIPS, and BLEU/CIDEr scores. Additional analyses reveal the scalability, robustness, and generalization capabilities of VDPO, making it a versatile…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Multimodal Machine Learning Applications · Natural Language Processing Techniques