ViBe: Ultra-High-Resolution Video Synthesis Born from Pure Images
Yunfeng Wu, Hongying Cheng, Zihao He, and Songhua Liu

TL;DR
This paper introduces ViBe, a novel framework for ultra-high-resolution video synthesis from pure images, leveraging a two-stage adaptation strategy and high-frequency training to produce detailed videos without video training data.
Contribution
We propose Relay LoRA, a two-stage adaptation method for high-res video synthesis from images, and a high-frequency training objective to improve detail recovery.
Findings
Outperforms state-of-the-art models on VBench benchmark.
Generates ultra-high-resolution videos with rich details.
Does not require video training data.
Abstract
Transformer-based video diffusion models rely on 3D attention over spatial and temporal tokens, which incurs quadratic time and memory complexity and makes end-to-end training for ultra-high-resolution videos prohibitively expensive. To overcome this bottleneck, we propose a pure image adaptation framework that upgrades a video Diffusion Transformer pre-trained at its native scale to synthesize higher-resolution videos. Unfortunately, naively fine-tuning with high-resolution images alone often introduces noticeable noise due to the image-video modality gap. To address this, we decouple the learning objective to separately handle modality alignment and spatial extrapolation. At the core of our approach is Relay LoRA, a two-stage adaptation strategy. In the first stage, the video diffusion model is adapted to the image domain using low-resolution images to bridge the modality gap. In the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques · Image and Video Quality Assessment
