Beyond Boundary Frames: Context-Centric Video Interpolation with Audio-Visual Semantics
Yuchen Deng, Xiuyang Wu, Hai-Tao Zheng, Jie Wang, Feidiao Yang, Yuxing Han

TL;DR
This paper introduces BBF, a novel context-centric video frame interpolation framework that leverages multimodal signals like audio and text to improve controllability and accuracy in complex motion scenarios.
Contribution
It reformulates video frame interpolation as a context-centric generation task and proposes a multi-stream integration mechanism with progressive training for enhanced performance.
Findings
BBF outperforms state-of-the-art methods on generic and audio-visual tasks.
The framework effectively integrates multimodal signals for better interpolation.
Extensive experiments validate the superiority of the proposed approach.
Abstract
Video frame interpolation has long been challenged by limited controllability and interactivity, especially in scenarios involving fast, highly non-linear, and fine-grained motion. Although recent interactive interpolation methods have made progress, they remain largely boundary-centric and ignore auxiliary contextual signals beyond the start and end frames, leading to outputs that deviate from user-intended objectives. To address this issue, we reformulate VFI from a boundary-centric task into a context-centric generation problem. Based on this, we propose BBF (Beyond Boundary Frames), a context-centric video frame interpolation framework with decoupled multimodal conditioning, which jointly exploits endpoint-adjacent visual context, text semantics, and audio-correlated temporal dynamics. To balance endpoint consistency with context-dependent temporal evolution, BBF further introduces…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
