BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations
Weixi Feng, Chao Liu, Sifei Liu, William Yang Wang, Arash Vahdat,, Weili Nie

TL;DR
BlobGEN-Vid introduces a controllable, compositional text-to-video generation framework using blob video representations, enabling detailed object control and smooth transitions, outperforming existing models in zero-shot and layout controllability tasks.
Contribution
We propose a novel blob video representation and a diffusion model that enhances controllability and compositionality in text-to-video generation, with effective regional consistency and semantic interpolation.
Findings
Achieves state-of-the-art zero-shot video generation.
Outperforms existing models in layout controllability.
Enables fine-grained control over object motion and appearance.
Abstract
Existing video generation models struggle to follow complex text prompts and synthesize multiple objects, raising the need for additional grounding input for improved controllability. In this work, we propose to decompose videos into visual primitives - blob video representation, a general representation for controllable video generation. Based on blob conditions, we develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance. In particular, we introduce a masked 3D attention module that effectively improves regional consistency across frames. In addition, we introduce a learnable module to interpolate text embeddings so that users can control semantics in specific frames and obtain smooth object transitions. We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Natural Language Processing Techniques · Human Motion and Animation
MethodsSoftmax · Attention Is All You Need · Max Pooling · Convolution · *Communicated@Fast*How Do I Communicate to Expedia? · Concatenated Skip Connection · U-Net · Diffusion
