iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance
Jun Zheng, Zhengze Xu, Mengting Chen, Jing Wang, Jinsong Lan, Xiaoyong Zhu, Kaifu Zhang, Bo Zheng, Xiaodan Liang

TL;DR
This paper introduces iTryOn, a novel framework for interactive video virtual try-on that incorporates spatial-semantic guidance to handle complex human-garment interactions and deformations.
Contribution
It formalizes the new task of Interactive VVT and proposes a diffusion Transformer-based model with multi-level interaction guidance and a novel A-RoPE embedding.
Findings
Achieves state-of-the-art on traditional VVT benchmarks.
Leads in the new interactive VVT setting.
Effectively models complex garment dynamics during interactions.
Abstract
Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
