ARMANI: Part-level Garment-Text Alignment for Unified Cross-Modal Fashion Design
Xujie Zhang, Yu Sha, Michael C. Kampffmeyer, Zhenyu Xie, Zequn Jie,, Chengwen Huang, Jianqing Peng, Xiaodan Liang

TL;DR
ARMANI is a novel cross-modal fashion design model that uses part-level garment-text alignment and a two-stage process with a cross-modal Transformer to generate realistic fashion images from various control signals.
Contribution
It introduces MaskCLIP for fine-grained garment-text alignment and a unified two-stage framework with a cross-modal Transformer for versatile fashion image synthesis.
Findings
Outperforms existing cross-modal synthesis methods.
Generates photo-realistic fashion images from diverse control signals.
Demonstrates effectiveness on a new cross-modal fashion dataset.
Abstract
Cross-modal fashion image synthesis has emerged as one of the most promising directions in the generation domain due to the vast untapped potential of incorporating multiple modalities and the wide range of fashion image applications. To facilitate accurate generation, cross-modal synthesis methods typically rely on Contrastive Language-Image Pre-training (CLIP) to align textual and garment information. In this work, we argue that simply aligning texture and garment information is not sufficient to capture the semantics of the visual information and therefore propose MaskCLIP. MaskCLIP decomposes the garments into semantic parts, ensuring fine-grained and semantically accurate alignment between the visual and text information. Building on MaskCLIP, we propose ARMANI, a unified cross-modal fashion designer with part-level garment-text alignment. ARMANI discretizes an image into uniform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Multi-Head Attention · Absolute Position Encodings · Label Smoothing · Position-Wise Feed-Forward Layer · Layer Normalization · Softmax · Residual Connection
