DH-VTON: Deep Text-Driven Virtual Try-On via Hybrid Attention Learning
Jiabao Wei, Zhiyuan Ma

TL;DR
DH-VTON is a novel deep learning model for virtual try-on that uses hybrid attention and deep semantic garment features to improve realism and detail preservation in synthesized images.
Contribution
The paper introduces DH-VTON, combining InternViT-6B and GFC+ modules with hybrid attention to enhance semantic garment understanding and multi-scale feature integration in virtual try-on.
Findings
Outperforms previous diffusion and GAN-based methods
Preserves garment details effectively
Generates more authentic human images
Abstract
Virtual Try-ON (VTON) aims to synthesis specific person images dressed in given garments, which recently receives numerous attention in online shopping scenarios. Currently, the core challenges of the VTON task mainly lie in the fine-grained semantic extraction (i.e.,deep semantics) of the given reference garments during depth estimation and effective texture preservation when the garments are synthesized and warped onto human body. To cope with these issues, we propose DH-VTON, a deep text-driven virtual try-on model featuring a special hybrid attention learning strategy and deep garment semantic preservation module. By standing on the shoulder of a well-built pre-trained paint-by-example (abbr. PBE) approach, we present our DH-VTON pipeline in this work. Specifically, to extract the deep semantics of the garments, we first introduce InternViT-6B as fine-grained feature learner, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text and Document Classification Technologies
MethodsSoftmax · Attention Is All You Need · ALIGN · Contrastive Language-Image Pre-training
