DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer
Hengye Lyu, Zisu Li, Yue Hong, Yueting Weng, Jiaxin Shi, Hanwang Zhang, Chen Liang

TL;DR
This paper introduces RTR-DiT, a diffusion transformer-based framework for real-time, stable, and consistent long video stylization that supports interactive style switching.
Contribution
The work presents a novel autoregressive diffusion transformer model with a reference-preserving cache strategy for real-time long video stylization.
Findings
Outperforms existing methods in quantitative metrics and visual quality.
Supports real-time long video stylization and interactive style switching.
Enables stable and consistent processing of long videos.
Abstract
Recent advances in video generation models has significantly accelerated video generation and related downstream tasks. Among these, video stylization holds important research value in areas such as immersive applications and artistic creation, attracting widespread attention. However, existing diffusion-based video stylization methods struggle to maintain stability and consistency when processing long videos, and their high computational cost and multi-step denoising make them difficult to apply in practical scenarios. In this work, we propose RTR-DiT (DiT as Real-Time Rerenderer), a steaming video stylization framework built upon Diffusion Transformer. We first fine-tune a bidirectional teacher model on a curated video stylization dataset, supporting both text-guided and reference-guided video stylization tasks, and subsequently distill it into a few-step autoregressive model via…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
