TL;DR
DiffST introduces an efficient spatiotemporal-aware diffusion framework for real-world space-time video super-resolution, significantly improving inference speed and spatiotemporal information utilization.
Contribution
The paper proposes DiffST, a novel diffusion-based model with one-step sampling and cross-frame context aggregation for enhanced efficiency and spatiotemporal modeling in STVSR.
Findings
DiffST achieves state-of-the-art results on real-world STVSR tasks.
It runs about 17 times faster than previous diffusion-based methods.
Extensive experiments validate the effectiveness of CFCA and VRG modules.
Abstract
Diffusion-based models have shown strong performance in video super-resolution (VSR) and video frame interpolation (VFI). However, their role in the coupled space-time video super-resolution (STVSR) setting remains limited. Existing diffusion-based STVSR approaches suffer from two issues: (1) low inference efficiency and (2) insufficient utilization of spatiotemporal information. These limitations impede deployment. To address these issues, we introduce DiffST, an efficient spatiotemporal-aware video diffusion framework for real-world STVSR. To improve efficiency, we adapt a pre-trained diffusion model for one-step sampling and process the entire video directly rather than operating on individual frames. Furthermore, to enhance spatiotemporal information utilization, we introduce cross-frame context aggregation (CFCA) and video representation guidance (VRG). The CFCA module aggregates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
