VISTA: Triplet-Supervised Video Style Transfer with Diffusion Transformers
Yiren Song, Wangzi Yao, Haofan Wang, Mike Zheng Shou

TL;DR
VISTA introduces a large synthetic dataset and a diffusion-transformer framework for improved, consistent video style transfer that effectively disentangles style, content, and motion.
Contribution
The paper presents VISTA-1000, a synthetic triplet dataset, and a novel diffusion-transformer model for robust, temporally consistent video style transfer.
Findings
Achieves state-of-the-art style fidelity and temporal consistency.
Outperforms existing methods in content preservation.
Robust under occlusions and complex motions.
Abstract
Video style transfer aims to render videos in a target artistic style while preserving content, structure, and motion. While image stylization has advanced rapidly, video stylization remains challenging due to temporal inconsistency. Most existing methods stylize frames or keyframes and enforce consistency via heuristic temporal propagation, which is brittle under occlusions, disocclusions, and long-term motion, leading to drift and flickering artifacts. We argue that a fundamental bottleneck lies in the lack of large-scale triplet data and a principled training paradigm that jointly models and disentangles style, content, and motion.To address this, we introduce VISTA-1000, a synthetic dataset with 1,000 styles and motion-aligned triplets of style reference, clean video, and stylized video, and propose a diffusion-transformer-based in-context video style transfer framework with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
