iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

Jun Zheng; Zhengze Xu; Mengting Chen; Jing Wang; Jinsong Lan; Xiaoyong Zhu; Kaifu Zhang; Bo Zheng; Xiaodan Liang

arXiv:2605.21431·cs.CV·May 21, 2026

iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

Jun Zheng, Zhengze Xu, Mengting Chen, Jing Wang, Jinsong Lan, Xiaoyong Zhu, Kaifu Zhang, Bo Zheng, Xiaodan Liang

PDF

TL;DR

This paper introduces iTryOn, a novel framework for interactive video virtual try-on that incorporates spatial-semantic guidance to handle complex human-garment interactions and deformations.

Contribution

It formalizes the new task of Interactive VVT and proposes a diffusion Transformer-based model with multi-level interaction guidance and a novel A-RoPE embedding.

Findings

01

Achieves state-of-the-art on traditional VVT benchmarks.

02

Leads in the new interactive VVT setting.

03

Effectively models complex garment dynamics during interactions.

Abstract

Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.