Compositional Video Generation as Flow Equalization

Xingyi Yang; Xinchao Wang

arXiv:2407.06182·cs.CV·July 9, 2024

Compositional Video Generation as Flow Equalization

Xingyi Yang, Xinchao Wang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

Vico is a framework that improves text-to-video generation by analyzing and balancing the influence of different concepts in the model, leading to more accurate and compositional videos.

Contribution

Vico introduces a novel attention flow-based method to explicitly balance concept influences in diffusion models for improved compositional video generation.

Findings

01

Enhanced compositional accuracy in generated videos

02

Better adherence to complex textual descriptions

03

Applicable to multiple diffusion-based video models

Abstract

Large-scale Text-to-Video (T2V) diffusion models have recently demonstrated unprecedented capability to transform natural language descriptions into stunning and photorealistic videos. Despite the promising results, a significant challenge remains: these models struggle to fully grasp complex compositional interactions between multiple concepts and actions. This issue arises when some words dominantly influence the final video, overshadowing other concepts.To tackle this problem, we introduce \textbf{Vico}, a generic framework for compositional video generation that explicitly ensures all concepts are represented properly. At its core, Vico analyzes how input tokens influence the generated video, and adjusts the model to prevent any single concept from dominating. Specifically, Vico extracts attention weights from all layers to build a spatial-temporal attention graph, and then…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 8Confidence 3

Strengths

1. The proposed method is innovative and can be integrated with existing video generation techniques. The experiments demonstrate the effectiveness of applying the "Vico" method on current models such as AnimaDiff, ZeroScore V2, and VideoCrafter V2, showing notable improvements in results. 2. The paper is well-articulated, featuring thorough theoretical analysis and proof.

Weaknesses

1. The user study (Table 2) is limited to only ten video clips, which is insufficient to conclusively prove the effectiveness of the method.

Reviewer 02Rating 8Confidence 4

Strengths

Capturing compositional relationships in the final generated output is a very important problem and, to the best of my knowledge, is one of the biggest issues with current SOA in GenAI. This paper correctly identifies one of the main issues throughout the attention mechanism and tries to improve the contribution of different tokens in attention layers as a test-time optimization. The authors model the information flow of the generative model as a graph, which is a smart and (semi-)novel strategy

Weaknesses

The biggest weakness of the solution is the readability of this paper. It was very hard for me to read through the text and jump from text to mathematical notations and back. I will ask for a few clarifications in the questions block.

Reviewer 03Rating 6Confidence 5

Strengths

1. This paper presents a highly innovative solution to the problem, utilizing traditional max flow to address token-level response balancing, thereby achieving effective compositional generation. 2. I appreciate that extensive effort has been put into designing feasible experiments. The authors introduce practical techniques such as subgraph, soft min, and vectorized flow strategies, which significantly enhance inference speed. 3. The experiments are thorough and well-executed, including compreh

Weaknesses

1. The primary concern is the lack of comparisons or discussions involving recent text-to-video (T2V) methods. The baseline model, VideoCrafter2, was released over a year ago. To convincingly demonstrate the relevance of the compositional generation problem, the paper should ideally compare against more advanced, recent baselines like OpenSora[1], CogVideoX[2], or more. 2. Additionally, the paper lacks comparisons to existing compositional video generation models. For instance, methods like LVD

Code & Models

Repositories

Adamdad/vico
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCinema and Media Studies

MethodsDiffusion