Multi-Level Conditioning by Pairing Localized Text and Sketch for Fashion Image Generation

Ziyue Liu; Davide Talon; Federico Girella; Zanxi Ruan; Mattia Mondo; Loris Bazzani; Yiming Wang; Marco Cristani

arXiv:2602.18309·cs.CV·February 23, 2026

Multi-Level Conditioning by Pairing Localized Text and Sketch for Fashion Image Generation

Ziyue Liu, Davide Talon, Federico Girella, Zanxi Ruan, Mattia Mondo, Loris Bazzani, Yiming Wang, Marco Cristani

PDF

Open Access 1 Models

TL;DR

This paper introduces LOTS, a novel framework that combines global sketches with multiple localized text-sketch pairs to improve fashion image generation, validated on a new dataset called Sketchy.

Contribution

The paper proposes LOTS, a multi-level conditioning framework that effectively integrates local and global guidance for fashion image synthesis, along with a new dataset, Sketchy.

Findings

01

Improves adherence to global structure in generated images

02

Leverages multiple localized semantic cues for detailed synthesis

03

Outperforms state-of-the-art methods in fashion image generation

Abstract

Sketches offer designers a concise yet expressive medium for early-stage fashion ideation by specifying structure, silhouette, and spatial relationships, while textual descriptions complement sketches to convey material, color, and stylistic details. Effectively combining textual and visual modalities requires adherence to the sketch visual structure when leveraging the guidance of localized attributes from text. We present LOcalized Text and Sketch with multi-level guidance (LOTS), a framework that enhances fashion image generation by combining global sketch guidance with multiple localized sketch-text pairs. LOTS employs a Multi-level Conditioning Stage to independently encode local features within a shared latent space while maintaining global structural coordination. Then, the Diffusion Pair Guidance stage integrates both local and global conditioning via attention-based guidance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
zyyyy/lots-extension
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis · Face recognition and analysis