Video-Infinity: Distributed Long Video Generation
Zhenxiong Tan, Xingyi Yang, Songhua Liu, Xinchao Wang

TL;DR
Video-Infinity introduces a distributed inference pipeline with novel mechanisms to generate long videos efficiently across multiple GPUs, significantly reducing generation time compared to existing methods.
Contribution
The paper presents a new distributed inference framework with Clip parallelism and Dual-scope attention for long video generation without retraining models.
Findings
Generated 2,300-frame videos in 5 minutes on 8 GPUs.
Achieved 100x faster long video generation than prior methods.
Enabled scalable long video synthesis without additional training.
Abstract
Diffusion models have recently achieved remarkable results for video generation. Despite the encouraging performances, the generated videos are typically constrained to a small number of frames, resulting in clips lasting merely a few seconds. The primary challenges in producing longer videos include the substantial memory requirements and the extended processing time required on a single GPU. A straightforward solution would be to split the workload across multiple GPUs, which, however, leads to two issues: (1) ensuring all GPUs communicate effectively to share timing and context information, and (2) modifying existing video diffusion models, which are usually trained on short sequences, to create longer videos without additional training. To tackle these, in this paper we introduce Video-Infinity, a distributed inference pipeline that enables parallel processing across multiple GPUs…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. This work brings incremental novelty by adapting distributed parallelism specifically for long-form video generation. It introduces a dual-scope attention mechanism to balance local and global temporal interactions, ensuring coherence across extended sequences. The clip parallelism approach further enables efficient processing of video clips across GPUs, effectively handling the unique scalability and memory demands of video data. These adaptations, including optimizations for temporal contin
1. Performance. In the Table 2 under 64 frames settings, although the proposed work got the highest overall score, it did not showed dominating better results than other baselines. 2. Results on longer context. This work claims capability to generate longer video clips, while it only shows results for a maximum of 192 frames in Table 2. Since it emphasis the long video generation ability, I would suggest putting more quantitive results on longer video. 3. Results on memory usage comparison. T
1. The paper is well-written and easy to follow. 1. It is a training-free inference pipeline while extending the baseline model generation capacity. 2. ***Dual-scope Attention*** provides a new view of gathering the global and local context for high-fidelity long video generation. The generation results are impressive. It might provide insight into the training scheme or new architecture design.
1. The novelty of **Clip Parallelism** is limited. The paper merely migrates the DistriFusion[1] to the video diffusion model, where DistriFusion splits a large image into patches while this paper splits a long video into short clips. The distributed modules are similar to the sparse operations in DistriFusion[1], except for extending the sparse 2D convolution to the 1D/3D temporal convolution with different padding schemes. Also, the *GroupNorm* modification is similar. Moreover, the DistriFusi
1. The empirical results are persuasive, with Video-Infinity achieving a 10x improvement over comparable methods like FIFO-Diffusion and being significantly faster than alternatives like Streaming T2V. 2. The paper is well-organized, clearly outlining the technical details, methodology, and communication strategies.
1. It looks like this work adopt the idea from DistriFusion [1]. While the authors claim to tackle a more challenging problem, the dimensionality of frames, from a technical standpoint, is actually much simpler to manage compared to the problems addressed in DistriFusion. 2. How does this method impact frame-to-frame continuity? I noticed that many of the generated videos in the Supplementary Material exhibit noticeable continuity issues. The authors do not seem to have adequately addressed th
1. The integration of Clip parallelism and Dual-scope attention is a novel approach that effectively addresses the scalability and efficiency challenges in video generation. 2. The paper demonstrated ability to generate longer videos much faster than current methods, achieving substantial reductions in generation time. 3. Experiments are conducted to validate the performance, showcasing significant improvements over other methods in terms of speed and video length capabilities.
1. The method of synchronizing context across GPUs, crucial for maintaining temporal coherence, is not discussed detail. 2. While the framework improves efficiency, there is not much discussion on how these gains impact the qualitative aspects of the videos, such as resolution, realism, particularly under complex scene dynamics.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCinema and Media Studies · Computability, Logic, AI Algorithms
MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Contrastive Language-Image Pre-training · Diffusion · Adaptive Discriminator Augmentation
