Waver: Wave Your Way to Lifelike Video Generation
Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Bingyue Peng, Zehuan Yuan

TL;DR
Waver is a versatile foundation model capable of high-quality, multi-modal image and video generation, supporting text and image inputs, with superior motion and temporal consistency, advancing the state-of-the-art in video synthesis.
Contribution
Introducing Waver, a unified model for image and video generation with a novel Hybrid Stream DiT architecture and a comprehensive data curation pipeline for high-quality video synthesis.
Findings
Achieves top 3 ranking on T2V and I2V leaderboards.
Generates 5-10 second videos at 720p, upscaled to 1080p.
Outperforms existing open-source and commercial models.
Abstract
We present Waver, a high-performance foundation model for unified image and video generation. Waver can directly generate videos with durations ranging from 5 to 10 seconds at a native resolution of 720p, which are subsequently upscaled to 1080p. The model simultaneously supports text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) generation within a single, integrated framework. We introduce a Hybrid Stream DiT architecture to enhance modality alignment and accelerate training convergence. To ensure training data quality, we establish a comprehensive data curation pipeline and manually annotate and train an MLLM-based video quality model to filter for the highest-quality samples. Furthermore, we provide detailed training and inference recipes to facilitate the generation of high-quality videos. Building on these contributions, Waver excels at capturing complex motion,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
