Waver: Wave Your Way to Lifelike Video Generation

Yifu Zhang; Hao Yang; Yuqi Zhang; Yifei Hu; Fengda Zhu; Chuang Lin; Xiaofeng Mei; Yi Jiang; Bingyue Peng; Zehuan Yuan

arXiv:2508.15761·cs.CV·August 27, 2025

Waver: Wave Your Way to Lifelike Video Generation

Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Bingyue Peng, Zehuan Yuan

PDF

TL;DR

Waver is a versatile foundation model capable of high-quality, multi-modal image and video generation, supporting text and image inputs, with superior motion and temporal consistency, advancing the state-of-the-art in video synthesis.

Contribution

Introducing Waver, a unified model for image and video generation with a novel Hybrid Stream DiT architecture and a comprehensive data curation pipeline for high-quality video synthesis.

Findings

01

Achieves top 3 ranking on T2V and I2V leaderboards.

02

Generates 5-10 second videos at 720p, upscaled to 1080p.

03

Outperforms existing open-source and commercial models.

Abstract

We present Waver, a high-performance foundation model for unified image and video generation. Waver can directly generate videos with durations ranging from 5 to 10 seconds at a native resolution of 720p, which are subsequently upscaled to 1080p. The model simultaneously supports text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) generation within a single, integrated framework. We introduce a Hybrid Stream DiT architecture to enhance modality alignment and accelerate training convergence. To ensure training data quality, we establish a comprehensive data curation pipeline and manually annotate and train an MLLM-based video quality model to filter for the highest-quality samples. Furthermore, we provide detailed training and inference recipes to facilitate the generation of high-quality videos. Building on these contributions, Waver excels at capturing complex motion,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.