Accelerating Text-to-Video Generation with Calibrated Sparse Attention

Shai Yehezkel; Shahar Yadin; Noam Elata; Yaron Ostrovsky-Berman; Bahjat Kawar

arXiv:2603.05503·cs.CV·March 6, 2026

Accelerating Text-to-Video Generation with Calibrated Sparse Attention

Shai Yehezkel, Shahar Yadin, Noam Elata, Yaron Ostrovsky-Berman, Bahjat Kawar

PDF

Open Access

TL;DR

This paper introduces CalibAtt, a training-free sparse attention method that accelerates text-to-video diffusion models by identifying and skipping negligible token connections, achieving up to 1.58x speedup without quality loss.

Contribution

CalibAtt leverages offline calibration to identify stable sparsity patterns, enabling efficient inference in transformer-based video generation models.

Findings

01

Achieves up to 1.58x speedup in video generation.

02

Maintains quality and text-video alignment.

03

Outperforms existing training-free acceleration methods.

Abstract

Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduce CalibAtt, a training-free method that accelerates video generation via calibrated sparse attention. CalibAtt performs an offline calibration pass that identifies block-level sparsity and repetition patterns that are stable across inputs, and compiles these patterns into optimized attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis