Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers

Xinyu Peng; Han Li; Yuyang Huang; Ziyang Zheng; Yaoming Wang; Xin Chen; Wenrui Dai; Chenglin Li; Junni Zou; and Hongkai Xiong

arXiv:2601.14959·cs.CV·March 31, 2026

Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers

Xinyu Peng, Han Li, Yuyang Huang, Ziyang Zheng, Yaoming Wang, Xin Chen, Wenrui Dai, Chenglin Li, Junni Zou, and Hongkai Xiong

PDF

1 Repo 1 Models

TL;DR

This paper introduces LDF-VFI, a holistic auto-regressive diffusion transformer for video frame interpolation that ensures long-range temporal coherence and generalizes to high resolutions, achieving state-of-the-art results.

Contribution

It presents a novel video-centric framework with a skip-concatenate sampling strategy and efficient long-sequence processing, advancing VFI performance and stability.

Findings

01

Achieves state-of-the-art results on VFI benchmarks.

02

Ensures long-range temporal coherence in video sequences.

03

Generalizes to arbitrary spatial resolutions like 4K.

Abstract

Existing video frame interpolation (VFI) methods often adopt a frame-centric approach, processing videos as independent short segments (e.g., triplets), which leads to temporal inconsistencies and motion artifacts. To overcome this, we propose a holistic, video-centric paradigm named Local Diffusion Forcing for Video Frame Interpolation (LDF-VFI). Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence. To mitigate error accumulation inherent in auto-regressive generation, we introduce a novel skip-concatenate sampling strategy that effectively maintains temporal stability. Furthermore, LDF-VFI incorporates sparse, local attention and tiled VAE encoding, a combination that not only enables efficient processing of long sequences but also allows generalization to arbitrary spatial resolutions (e.g.,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xypeng9903/LDF-VFI
github

Models

🤗
onecat-ai/LDF-VFI
model· ♡ 3
♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.