Video Virtual Try-on with Conditional Diffusion Transformer Inpainter

Cheng Zou; Senlin Cheng; Bolei Xu; Dandan Zheng; Xiaobo Li; Jingdong Chen; Ming Yang

arXiv:2506.21270·cs.CV·June 27, 2025

Video Virtual Try-on with Conditional Diffusion Transformer Inpainter

Cheng Zou, Senlin Cheng, Bolei Xu, Dandan Zheng, Xiaobo Li, Jingdong Chen, Ming Yang

PDF

Open Access

TL;DR

This paper introduces ViTI, a novel video virtual try-on method using a diffusion transformer-based inpainting framework that enhances spatial-temporal consistency and detail preservation in video garments.

Contribution

Proposes a new video try-on approach as a conditional video inpainting task using a diffusion transformer with 3D attention, improving consistency and detail in virtual try-on videos.

Findings

01

Outperforms previous methods in spatial-temporal consistency.

02

Achieves high-quality garment detail preservation.

03

Demonstrates superior quantitative and qualitative results.

Abstract

Video virtual try-on aims to naturally fit a garment to a target person in consecutive video frames. It is a challenging task, on the one hand, the output video should be in good spatial-temporal consistency, on the other hand, the details of the given garment need to be preserved well in all the frames. Naively using image-based try-on methods frame by frame can get poor results due to severe inconsistency. Recent diffusion-based video try-on methods, though very few, happen to coincide with a similar solution: inserting temporal attention into image-based try-on model to adapt it for video try-on task, which have shown improvements but there still exist inconsistency problems. In this paper, we propose ViTI (Video Try-on Inpainter), formulate and implement video virtual try-on as a conditional video inpainting task, which is different from previous methods. In this way, we start with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · Advanced Vision and Imaging

MethodsLayer Normalization · Dropout · Absolute Position Encodings · Dense Connections · Byte Pair Encoding · Softmax · Label Smoothing · Transformer · Inpainting · Diffusion