HARIVO: Harnessing Text-to-Image Models for Video Generation
Mingi Kwon, Seoung Wug Oh, Yang Zhou, Difan Liu, Joon-Young Lee,, Haoran Cai, Baqiao Liu, Feng Liu, Youngjung Uh

TL;DR
HARIVO introduces a novel architecture that leverages pretrained text-to-image models for generating temporally consistent videos, enhancing realism and diversity while requiring limited video data.
Contribution
It advances existing methods by integrating a mapping network and frame-wise tokens, along with new loss functions and gradient sampling techniques for improved video quality.
Findings
Achieved realistic, temporally smooth video generation
Successfully integrated video-specific inductive biases
Built on frozen StableDiffusion for simplified training
Abstract
We present a method to create diffusion-based video models from pretrained Text-to-Image (T2I) models. Recently, AnimateDiff proposed freezing the T2I model while only training temporal layers. We advance this method by proposing a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model. Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique, ensuring realistic and temporally consistent video generation despite limited public video data. We have successfully integrated video-specific inductive biases into the architecture and loss functions. Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Video Analysis and Summarization · Generative Adversarial Networks and Image Synthesis
