Coherent Zero-Shot Visual Instruction Generation

Quynh Phung; Songwei Ge; Jia-Bin Huang

arXiv:2406.04337·cs.CV·June 11, 2024

Coherent Zero-Shot Visual Instruction Generation

Quynh Phung, Songwei Ge, Jia-Bin Huang

PDF

Open Access

TL;DR

This paper presents a training-free method that combines diffusion models and large language models to generate consistent, visually appealing multi-step visual instructions, addressing a key challenge in text-to-image synthesis.

Contribution

It introduces a novel, training-free framework that ensures visual consistency and smooth transitions in sequential image generation from instructions.

Findings

01

Effective multi-step instruction visualization

02

Maintains visual consistency across steps

03

Outperforms baseline methods in alignment and quality

Abstract

Despite the advances in text-to-image synthesis, particularly with diffusion models, generating visual instructions that require consistent representation and smooth state transitions of objects across sequential steps remains a formidable challenge. This paper introduces a simple, training-free framework to tackle the issues, capitalizing on the advancements in diffusion models and large language models (LLMs). Our approach systematically integrates text comprehension and image generation to ensure visual instructions are visually appealing and maintain consistency and accuracy throughout the instruction sequence. We validate the effectiveness by testing multi-step instructions and comparing the text alignment and consistency with several baselines. Our experiments show that our approach can visualize coherent and visually pleasing instructions

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Image Processing Techniques and Applications · Video Analysis and Summarization

MethodsDiffusion