VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via   Diffusion Transformers

Jun Zheng; Fuwei Zhao; Youjiang Xu; Xin Dong; Xiaodan Liang

arXiv:2405.18326·cs.CV·June 10, 2024

VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers

Jun Zheng, Fuwei Zhao, Youjiang Xu, Xin Dong, Xiaodan Liang

PDF

Open Access

TL;DR

VITON-DiT introduces a diffusion transformer-based framework for realistic in-the-wild video try-on, capable of handling casual videos with complex poses without requiring paired training data.

Contribution

First DiT-based video try-on framework that works with unpaired dance videos, using novel training strategies and a new benchmark dataset.

Findings

01

Outperforms existing methods in spatio-temporal consistency

02

Effective on casual videos with complex human poses

03

Does not require paired training datasets

Abstract

Video try-on stands as a promising area for its tremendous real-world potential. Prior works are limited to transferring product clothing images onto person videos with simple poses and backgrounds, while underperforming on casually captured videos. Recently, Sora revealed the scalability of Diffusion Transformer (DiT) in generating lifelike videos featuring real-world scenarios. Inspired by this, we explore and propose the first DiT-based video try-on framework for practical in-the-wild applications, named VITON-DiT. Specifically, VITON-DiT consists of a garment extractor, a Spatial-Temporal denoising DiT, and an identity preservation ControlNet. To faithfully recover the clothing details, the extracted garment features are fused with the self-attention outputs of the denoising DiT and the ControlNet. We also introduce novel random selection strategies during training and an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Diversity and Impact of Dance · Music and Audio Processing

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections