MSC: Multi-Scale Spatio-Temporal Causal Attention for Autoregressive   Video Diffusion

Xunnong Xu; Mengying Cao

arXiv:2412.09828·cs.CV·December 16, 2024

MSC: Multi-Scale Spatio-Temporal Causal Attention for Autoregressive Video Diffusion

Xunnong Xu, Mengying Cao

PDF

TL;DR

This paper introduces MSC, a multi-scale causal attention framework for autoregressive video diffusion that reduces computational complexity and enables efficient high-resolution video generation with rich semantics.

Contribution

It proposes a novel multi-scale causal attention mechanism that improves efficiency and supports autoregressive long video generation without violating temporal order.

Findings

01

Reduces computational complexity in video diffusion models

02

Enables high-resolution video generation with rich semantics

03

Supports autoregressive long video synthesis

Abstract

Diffusion transformers enable flexible generative modeling for video. However, it is still technically challenging and computationally expensive to generate high-resolution videos with rich semantics and complex motion. Similar to languages, video data are also auto-regressive by nature, so it is counter-intuitive to use attention mechanism with bi-directional dependency in the model. Here we propose a Multi-Scale Causal (MSC) framework to address these problems. Specifically, we introduce multiple resolutions in the spatial dimension and high-low frequencies in the temporal dimension to realize efficient attention calculation. Furthermore, attention blocks on multiple scales are combined in a controlled way to allow causal conditioning on noisy image frames for diffusion training, based on the idea that noise destroys information at different rates on different resolutions. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need · Diffusion