Training-Free Sketch-Guided Diffusion with Latent Optimization
Sandra Zhang Ding, Jiafeng Mao, Kiyoharu Aizawa

TL;DR
This paper introduces a training-free, sketch-guided image generation method that uses latent optimization and cross-attention maps to produce images closely aligned with user sketches, enhancing control and customization.
Contribution
It presents a novel training-free pipeline that incorporates sketch guidance into diffusion models via latent optimization and cross-attention maps, improving image structure fidelity.
Findings
Enhanced image accuracy with sketch guidance
Training-free approach simplifies integration
Latent optimization refines structure adherence
Abstract
Based on recent advanced diffusion models, Text-to-image (T2I) generation models have demonstrated their capabilities to generate diverse and high-quality images. However, leveraging their potential for real-world content creation, particularly in providing users with precise control over the image generation result, poses a significant challenge. In this paper, we propose an innovative training-free pipeline that extends existing text-to-image generation models to incorporate a sketch as an additional condition. To generate new images with a layout and structure closely resembling the input sketch, we find that these core features of a sketch can be tracked with the cross-attention maps of diffusion models. We introduce latent optimization, a method that refines the noisy latent at each intermediate step of the generation process using cross-attention maps to ensure that the generated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
MethodsDiffusion
