MADiff: Text-Guided Fashion Image Editing with Mask Prediction and   Attention-Enhanced Diffusion

Zechao Zhan; Dehong Gao; Jinxia Zhang; Jiale Huang; Yang Hu; Xin Wang

arXiv:2412.20062·cs.CV·January 16, 2025

MADiff: Text-Guided Fashion Image Editing with Mask Prediction and Attention-Enhanced Diffusion

Zechao Zhan, Dehong Gao, Jinxia Zhang, Jiale Huang, Yang Hu, Xin Wang

PDF

Open Access

TL;DR

MADiff introduces a novel approach for fashion image editing that combines mask prediction and attention-enhanced diffusion to improve localization and editing strength, addressing limitations of existing text-guided models in the fashion domain.

Contribution

The paper proposes MADiff, a new model with MaskNet for accurate region localization and an Attention-Enhanced Diffusion Model for stronger editing, tailored for fashion image editing.

Findings

01

Accurately predicts editing regions in fashion images.

02

Significantly improves editing magnitude over state-of-the-art methods.

03

Constructed the Fashion-E dataset for benchmarking fashion image editing.

Abstract

Text-guided image editing model has achieved great success in general domain. However, directly applying these models to the fashion domain may encounter two issues: (1) Inaccurate localization of editing region; (2) Weak editing magnitude. To address these issues, the MADiff model is proposed. Specifically, to more accurately identify editing region, the MaskNet is proposed, in which the foreground region, densepose and mask prompts from large language model are fed into a lightweight UNet to predict the mask for editing region. To strengthen the editing magnitude, the Attention-Enhanced Diffusion Model is proposed, where the noise map, attention map, and the mask from MaskNet are fed into the proposed Attention Processor to produce a refined noise map. By integrating the refined noise map into the diffusion model, the edited image can better align with the target prompt. Given the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis · Face recognition and analysis

MethodsSoftmax · Attention Is All You Need · Diffusion · ALIGN