Compositional Video Generation as Flow Equalization
Xingyi Yang, Xinchao Wang

TL;DR
Vico is a framework that improves text-to-video generation by analyzing and balancing the influence of different concepts in the model, leading to more accurate and compositional videos.
Contribution
Vico introduces a novel attention flow-based method to explicitly balance concept influences in diffusion models for improved compositional video generation.
Findings
Enhanced compositional accuracy in generated videos
Better adherence to complex textual descriptions
Applicable to multiple diffusion-based video models
Abstract
Large-scale Text-to-Video (T2V) diffusion models have recently demonstrated unprecedented capability to transform natural language descriptions into stunning and photorealistic videos. Despite the promising results, a significant challenge remains: these models struggle to fully grasp complex compositional interactions between multiple concepts and actions. This issue arises when some words dominantly influence the final video, overshadowing other concepts.To tackle this problem, we introduce \textbf{Vico}, a generic framework for compositional video generation that explicitly ensures all concepts are represented properly. At its core, Vico analyzes how input tokens influence the generated video, and adjusts the model to prevent any single concept from dominating. Specifically, Vico extracts attention weights from all layers to build a spatial-temporal attention graph, and then…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The proposed method is innovative and can be integrated with existing video generation techniques. The experiments demonstrate the effectiveness of applying the "Vico" method on current models such as AnimaDiff, ZeroScore V2, and VideoCrafter V2, showing notable improvements in results. 2. The paper is well-articulated, featuring thorough theoretical analysis and proof.
1. The user study (Table 2) is limited to only ten video clips, which is insufficient to conclusively prove the effectiveness of the method.
Capturing compositional relationships in the final generated output is a very important problem and, to the best of my knowledge, is one of the biggest issues with current SOA in GenAI. This paper correctly identifies one of the main issues throughout the attention mechanism and tries to improve the contribution of different tokens in attention layers as a test-time optimization. The authors model the information flow of the generative model as a graph, which is a smart and (semi-)novel strategy
The biggest weakness of the solution is the readability of this paper. It was very hard for me to read through the text and jump from text to mathematical notations and back. I will ask for a few clarifications in the questions block.
1. This paper presents a highly innovative solution to the problem, utilizing traditional max flow to address token-level response balancing, thereby achieving effective compositional generation. 2. I appreciate that extensive effort has been put into designing feasible experiments. The authors introduce practical techniques such as subgraph, soft min, and vectorized flow strategies, which significantly enhance inference speed. 3. The experiments are thorough and well-executed, including compreh
1. The primary concern is the lack of comparisons or discussions involving recent text-to-video (T2V) methods. The baseline model, VideoCrafter2, was released over a year ago. To convincingly demonstrate the relevance of the compositional generation problem, the paper should ideally compare against more advanced, recent baselines like OpenSora[1], CogVideoX[2], or more. 2. Additionally, the paper lacks comparisons to existing compositional video generation models. For instance, methods like LVD
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCinema and Media Studies
MethodsDiffusion
