Video-Infinity: Distributed Long Video Generation

Zhenxiong Tan; Xingyi Yang; Songhua Liu; Xinchao Wang

arXiv:2406.16260·cs.CV·June 25, 2024

Video-Infinity: Distributed Long Video Generation

Zhenxiong Tan, Xingyi Yang, Songhua Liu, Xinchao Wang

PDF

Open Access 4 Reviews

TL;DR

Video-Infinity introduces a distributed inference pipeline with novel mechanisms to generate long videos efficiently across multiple GPUs, significantly reducing generation time compared to existing methods.

Contribution

The paper presents a new distributed inference framework with Clip parallelism and Dual-scope attention for long video generation without retraining models.

Findings

01

Generated 2,300-frame videos in 5 minutes on 8 GPUs.

02

Achieved 100x faster long video generation than prior methods.

03

Enabled scalable long video synthesis without additional training.

Abstract

Diffusion models have recently achieved remarkable results for video generation. Despite the encouraging performances, the generated videos are typically constrained to a small number of frames, resulting in clips lasting merely a few seconds. The primary challenges in producing longer videos include the substantial memory requirements and the extended processing time required on a single GPU. A straightforward solution would be to split the workload across multiple GPUs, which, however, leads to two issues: (1) ensuring all GPUs communicate effectively to share timing and context information, and (2) modifying existing video diffusion models, which are usually trained on short sequences, to create longer videos without additional training. To tackle these, in this paper we introduce Video-Infinity, a distributed inference pipeline that enables parallel processing across multiple GPUs…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 3

Strengths

1. This work brings incremental novelty by adapting distributed parallelism specifically for long-form video generation. It introduces a dual-scope attention mechanism to balance local and global temporal interactions, ensuring coherence across extended sequences. The clip parallelism approach further enables efficient processing of video clips across GPUs, effectively handling the unique scalability and memory demands of video data. These adaptations, including optimizations for temporal contin

Weaknesses

1. Performance. In the Table 2 under 64 frames settings, although the proposed work got the highest overall score, it did not showed dominating better results than other baselines. 2. Results on longer context. This work claims capability to generate longer video clips, while it only shows results for a maximum of 192 frames in Table 2. Since it emphasis the long video generation ability, I would suggest putting more quantitive results on longer video. 3. Results on memory usage comparison. T

Reviewer 02Rating 3Confidence 5

Strengths

1. The paper is well-written and easy to follow. 1. It is a training-free inference pipeline while extending the baseline model generation capacity. 2. ***Dual-scope Attention*** provides a new view of gathering the global and local context for high-fidelity long video generation. The generation results are impressive. It might provide insight into the training scheme or new architecture design.

Weaknesses

1. The novelty of **Clip Parallelism** is limited. The paper merely migrates the DistriFusion[1] to the video diffusion model, where DistriFusion splits a large image into patches while this paper splits a long video into short clips. The distributed modules are similar to the sparse operations in DistriFusion[1], except for extending the sparse 2D convolution to the 1D/3D temporal convolution with different padding schemes. Also, the *GroupNorm* modification is similar. Moreover, the DistriFusi

Reviewer 03Rating 3Confidence 4

Strengths

1. The empirical results are persuasive, with Video-Infinity achieving a 10x improvement over comparable methods like FIFO-Diffusion and being significantly faster than alternatives like Streaming T2V. 2. The paper is well-organized, clearly outlining the technical details, methodology, and communication strategies.

Weaknesses

1. It looks like this work adopt the idea from DistriFusion [1]. While the authors claim to tackle a more challenging problem, the dimensionality of frames, from a technical standpoint, is actually much simpler to manage compared to the problems addressed in DistriFusion. 2. How does this method impact frame-to-frame continuity? I noticed that many of the generated videos in the Supplementary Material exhibit noticeable continuity issues. The authors do not seem to have adequately addressed th

Reviewer 04Rating 6Confidence 4

Strengths

1. The integration of Clip parallelism and Dual-scope attention is a novel approach that effectively addresses the scalability and efficiency challenges in video generation. 2. The paper demonstrated ability to generate longer videos much faster than current methods, achieving substantial reductions in generation time. 3. Experiments are conducted to validate the performance, showcasing significant improvements over other methods in terms of speed and video length capabilities.

Weaknesses

1. The method of synchronizing context across GPUs, crucial for maintaining temporal coherence, is not discussed detail. 2. While the framework improves efficiency, there is not much discussion on how these gains impact the qualitative aspects of the videos, such as resolution, realism, particularly under complex scene dynamics.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCinema and Media Studies · Computability, Logic, AI Algorithms

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Contrastive Language-Image Pre-training · Diffusion · Adaptive Discriminator Augmentation