UniVid: Pyramid Diffusion Model for High Quality Video Generation
Xinyu Xiao, Binbin Yang, Tingtian Li, Yipeng Yu, Sen Lei

TL;DR
UniVid is a unified diffusion model that generates high-quality, temporally coherent videos from text prompts and reference images, effectively combining text-to-video and image-to-video generation paradigms.
Contribution
The paper introduces UniVid, a novel model that integrates text and image controls into a single diffusion framework for improved video generation.
Findings
Achieves superior temporal coherence on T2V, I2V, and combined tasks.
Effectively extracts object appearance and motion from text and texture from images.
Supports flexible bimodal control during inference.
Abstract
Diffusion-based text-to-video generation (T2V) or image-to-video (I2V) generation have emerged as a prominent research focus. However, there exists a challenge in integrating the two generative paradigms into a unified model. In this paper, we present a unified video generation model (UniVid) with hybrid conditions of the text prompt and reference image. Given these two available controls, our model can extract objects' appearance and their motion descriptions from textual prompts, while obtaining texture details and structural information from image clues to guide the video generation process. Specifically, we scale up the pre-trained text-to-image diffusion model for generating temporally coherent frames via introducing our temporal-pyramid cross-frame spatial-temporal attention modules and convolutions. To support bimodal control, we introduce a dual-stream cross-attention mechanism,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Human Motion and Animation
