TL;DR
This paper introduces LDF-VFI, a holistic auto-regressive diffusion transformer for video frame interpolation that ensures long-range temporal coherence and generalizes to high resolutions, achieving state-of-the-art results.
Contribution
It presents a novel video-centric framework with a skip-concatenate sampling strategy and efficient long-sequence processing, advancing VFI performance and stability.
Findings
Achieves state-of-the-art results on VFI benchmarks.
Ensures long-range temporal coherence in video sequences.
Generalizes to arbitrary spatial resolutions like 4K.
Abstract
Existing video frame interpolation (VFI) methods often adopt a frame-centric approach, processing videos as independent short segments (e.g., triplets), which leads to temporal inconsistencies and motion artifacts. To overcome this, we propose a holistic, video-centric paradigm named Local Diffusion Forcing for Video Frame Interpolation (LDF-VFI). Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence. To mitigate error accumulation inherent in auto-regressive generation, we introduce a novel skip-concatenate sampling strategy that effectively maintains temporal stability. Furthermore, LDF-VFI incorporates sparse, local attention and tiled VAE encoding, a combination that not only enables efficient processing of long sequences but also allows generalization to arbitrary spatial resolutions (e.g.,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
