ARMANI: Part-level Garment-Text Alignment for Unified Cross-Modal   Fashion Design

Xujie Zhang; Yu Sha; Michael C. Kampffmeyer; Zhenyu Xie; Zequn Jie,; Chengwen Huang; Jianqing Peng; Xiaodan Liang

arXiv:2208.05621·cs.CV·August 12, 2022

ARMANI: Part-level Garment-Text Alignment for Unified Cross-Modal Fashion Design

Xujie Zhang, Yu Sha, Michael C. Kampffmeyer, Zhenyu Xie, Zequn Jie,, Chengwen Huang, Jianqing Peng, Xiaodan Liang

PDF

TL;DR

ARMANI is a novel cross-modal fashion design model that uses part-level garment-text alignment and a two-stage process with a cross-modal Transformer to generate realistic fashion images from various control signals.

Contribution

It introduces MaskCLIP for fine-grained garment-text alignment and a unified two-stage framework with a cross-modal Transformer for versatile fashion image synthesis.

Findings

01

Outperforms existing cross-modal synthesis methods.

02

Generates photo-realistic fashion images from diverse control signals.

03

Demonstrates effectiveness on a new cross-modal fashion dataset.

Abstract

Cross-modal fashion image synthesis has emerged as one of the most promising directions in the generation domain due to the vast untapped potential of incorporating multiple modalities and the wide range of fashion image applications. To facilitate accurate generation, cross-modal synthesis methods typically rely on Contrastive Language-Image Pre-training (CLIP) to align textual and garment information. In this work, we argue that simply aligning texture and garment information is not sufficient to capture the semantics of the visual information and therefore propose MaskCLIP. MaskCLIP decomposes the garments into semantic parts, ensuring fine-grained and semantically accurate alignment between the visual and text information. Building on MaskCLIP, we propose ARMANI, a unified cross-modal fashion designer with part-level garment-text alignment. ARMANI discretizes an image into uniform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Multi-Head Attention · Absolute Position Encodings · Label Smoothing · Position-Wise Feed-Forward Layer · Layer Normalization · Softmax · Residual Connection