SSM Meets Video Diffusion Models: Efficient Long-Term Video Generation   with Structured State Spaces

Yuta Oshima; Shohei Taniguchi; Masahiro Suzuki; Yutaka Matsuo

arXiv:2403.07711·cs.CV·September 5, 2024·1 cites

SSM Meets Video Diffusion Models: Efficient Long-Term Video Generation with Structured State Spaces

Yuta Oshima, Shohei Taniguchi, Masahiro Suzuki, Yutaka Matsuo

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel approach for long-term video generation by integrating structured state-space models (SSMs) into diffusion models, reducing computational costs and improving performance over traditional attention-based methods.

Contribution

The paper demonstrates that bidirectional SSMs can effectively replace attention layers in diffusion models for video generation, enabling longer sequences with less memory and better quality.

Findings

01

SSM-based models require less memory for sequences up to 256 frames.

02

SSM models achieve comparable or better FVD scores than attention-based models.

03

Bidirectionality in SSMs enhances temporal feature capturing in videos.

Abstract

Given the remarkable achievements in image generation through diffusion models, the research community has shown increasing interest in extending these models to video generation. Recent diffusion models for video generation have predominantly utilized attention layers to extract temporal features. However, attention layers are limited by their computational costs, which increase quadratically with the sequence length. This limitation presents significant challenges when generating longer video sequences using diffusion models. To overcome this challenge, we propose leveraging state-space models (SSMs) as temporal feature extractors. SSMs (e.g., Mamba) have recently gained attention as promising alternatives due to their linear-time memory consumption relative to sequence length. In line with previous research suggesting that using bidirectional SSMs is effective for understanding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shim0114/ssm-meets-video-diffusion-models
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · Video Coding and Compression Technologies · Advanced Vision and Imaging

MethodsDiffusion