Coherent Zero-Shot Visual Instruction Generation
Quynh Phung, Songwei Ge, Jia-Bin Huang

TL;DR
This paper presents a training-free method that combines diffusion models and large language models to generate consistent, visually appealing multi-step visual instructions, addressing a key challenge in text-to-image synthesis.
Contribution
It introduces a novel, training-free framework that ensures visual consistency and smooth transitions in sequential image generation from instructions.
Findings
Effective multi-step instruction visualization
Maintains visual consistency across steps
Outperforms baseline methods in alignment and quality
Abstract
Despite the advances in text-to-image synthesis, particularly with diffusion models, generating visual instructions that require consistent representation and smooth state transitions of objects across sequential steps remains a formidable challenge. This paper introduces a simple, training-free framework to tackle the issues, capitalizing on the advancements in diffusion models and large language models (LLMs). Our approach systematically integrates text comprehension and image generation to ensure visual instructions are visually appealing and maintain consistency and accuracy throughout the instruction sequence. We validate the effectiveness by testing multi-step instructions and comparing the text alignment and consistency with several baselines. Our experiments show that our approach can visualize coherent and visually pleasing instructions
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Image Processing Techniques and Applications · Video Analysis and Summarization
MethodsDiffusion
