ViBe: Ultra-High-Resolution Video Synthesis Born from Pure Images

Yunfeng Wu; Hongying Cheng; Zihao He; and Songhua Liu

arXiv:2603.23326·cs.CV·March 25, 2026

ViBe: Ultra-High-Resolution Video Synthesis Born from Pure Images

Yunfeng Wu, Hongying Cheng, Zihao He, and Songhua Liu

PDF

Open Access

TL;DR

This paper introduces ViBe, a novel framework for ultra-high-resolution video synthesis from pure images, leveraging a two-stage adaptation strategy and high-frequency training to produce detailed videos without video training data.

Contribution

We propose Relay LoRA, a two-stage adaptation method for high-res video synthesis from images, and a high-frequency training objective to improve detail recovery.

Findings

01

Outperforms state-of-the-art models on VBench benchmark.

02

Generates ultra-high-resolution videos with rich details.

03

Does not require video training data.

Abstract

Transformer-based video diffusion models rely on 3D attention over spatial and temporal tokens, which incurs quadratic time and memory complexity and makes end-to-end training for ultra-high-resolution videos prohibitively expensive. To overcome this bottleneck, we propose a pure image adaptation framework that upgrades a video Diffusion Transformer pre-trained at its native scale to synthesize higher-resolution videos. Unfortunately, naively fine-tuning with high-resolution images alone often introduces noticeable noise due to the image-video modality gap. To address this, we decouple the learning objective to separately handle modality alignment and spatial extrapolation. At the core of our approach is Relay LoRA, a two-stage adaptation strategy. In the first stage, the video diffusion model is adapted to the image domain using low-resolution images to bridge the modality gap. In the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques · Image and Video Quality Assessment