MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on
Guangyuan Li, Siming Zheng, Hao Zhang, Jinwei Chen, Junsheng Luan, Binkai Ou, Lei Zhao, Bo Li, Peng-Tao Jiang

TL;DR
MagicTryOn is a novel diffusion-transformer framework that significantly improves garment detail preservation and temporal consistency in video virtual try-on, enabling realistic and stable garment synthesis across video frames.
Contribution
It introduces a garment-preserving strategy, a spatiotemporal RoPE, and a distillation method for real-time inference, advancing the state-of-the-art in garment-preserving VVT.
Findings
Outperforms existing methods in garment detail fidelity.
Achieves superior temporal stability and reduced jitter.
Enables real-time inference without loss of quality.
Abstract
Video Virtual Try-On (VVT) aims to synthesize garments that appear natural across consecutive video frames, capturing both their dynamics and interactions with human motion. Despite recent progress, existing VVT methods still suffer from inadequate garment fidelity and limited spatiotemporal consistency. The reasons are: (1) under-exploitation of garment information, with limited garment cues being injected, resulting in weaker fine-detail fidelity; and (2) a lack of spatiotemporal modeling, which hampers cross-frame identity consistency and causes temporal jitter and appearance drift. In this paper, we present MagicTryOn, a diffusion-transformer based framework for garment-preserving video virtual try-on. To preserve fine-grained garment details, we propose a fine-grained garment-preservation strategy that disentangles garment cues and injects these decomposed priors into the denoising…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Experiments are conducted on both image-based and video-based datasets. 2. Ablation studies are conducted to evaluate the effectiveness of each component.
1. The claim of correctly maintaining “compositional relationships” in multi-garment try-on is only supported by qualitative Fig.4, with no quantitative metrics (e.g., VFID-I3D, SSIM for multi-garment sequences) or statistical analysis. This leaves the performance of multi-garment handling unsubstantiated. 2. The main comparison (Table 1) excludes recent state-of-the-art methods like DreamVVT (Zuo et al., 2025) or SwiftTry (Nguyen et al., 2025), which also focus on temporal consistency. This inc
1. The paper proposes a well-structured diffusion transformer framework tailored for video virtual try-on with clear motivation and technical contributions. 2. It introduces innovative modules, including fine-grained garment-preservation and garment-aware spatiotemporal RoPE, effectively enhancing detail fidelity and temporal consistency. 3. The method achieves real-time inference through distribution-matching distillation while maintaining strong performance, supported by comprehensive experime
1. The proposed design appears rather standard, primarily relying on a combination of strong pretrained encoders and DiT blocks with cross-attention. The architectural novelty and unique algorithmic contribution seem limited. 2. The framework integrates numerous large components—VAE, T5 encoder, CLIP encoder, Qwen-7B, and Wan2.1—resulting in a highly complex system. It remains unclear which specific modules in Figure 2 are initialized with Wan2.1 pretrained weights, as mentioned in line 315. 3.
1. Innovative Spatiotemporal Encoding: The extension of RoPE to “garment-aware spatiotemporal RoPE” is principled and directly addresses temporal instability; the subsuming of garment tokens into the full self-attention with grid extension is mathematically described and justified. 2. Thorough Ablations: Table 3 and Figure 12 provide a detailed ablation of key architectural modules, with analyses that specify the performance and qualitative impact of removing each token stream or loss component,
1. Limited Novelty in Architectural Choices: While the garments’ semantic/structural/appearance decomposition and the cross-attention wiring are interesting, the overall method mainly combines existing mechanisms from prior VVT and diffusion transformer literature, such as patchification, CLIP feature usage, full self-attention, and cross-token fusion. As evident from the Related Work and as per foundational methods like ViViD, CatV2TON, or Hunyuan-DiT (all cited in the main text), many core str
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis · Human Pose and Action Recognition
