BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video   Representations

Weixi Feng; Chao Liu; Sifei Liu; William Yang Wang; Arash Vahdat,; Weili Nie

arXiv:2501.07647·cs.CV·January 15, 2025

BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations

Weixi Feng, Chao Liu, Sifei Liu, William Yang Wang, Arash Vahdat,, Weili Nie

PDF

Open Access

TL;DR

BlobGEN-Vid introduces a controllable, compositional text-to-video generation framework using blob video representations, enabling detailed object control and smooth transitions, outperforming existing models in zero-shot and layout controllability tasks.

Contribution

We propose a novel blob video representation and a diffusion model that enhances controllability and compositionality in text-to-video generation, with effective regional consistency and semantic interpolation.

Findings

01

Achieves state-of-the-art zero-shot video generation.

02

Outperforms existing models in layout controllability.

03

Enables fine-grained control over object motion and appearance.

Abstract

Existing video generation models struggle to follow complex text prompts and synthesize multiple objects, raising the need for additional grounding input for improved controllability. In this work, we propose to decompose videos into visual primitives - blob video representation, a general representation for controllable video generation. Based on blob conditions, we develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance. In particular, we introduce a masked 3D attention module that effectively improves regional consistency across frames. In addition, we introduce a learnable module to interpolate text embeddings so that users can control semantics in specific frames and obtain smooth object transitions. We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Natural Language Processing Techniques · Human Motion and Animation

MethodsSoftmax · Attention Is All You Need · Max Pooling · Convolution · *Communicated@Fast*How Do I Communicate to Expedia? · Concatenated Skip Connection · U-Net · Diffusion