ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers

Mohsen Ghafoorian; Amirhossein Habibian

arXiv:2601.04342·cs.CV·January 9, 2026

ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers

Mohsen Ghafoorian, Amirhossein Habibian

PDF

Open Access

TL;DR

ReHyAt introduces a hybrid recurrent attention mechanism for video diffusion transformers that significantly reduces attention complexity from quadratic to linear, enabling scalable, high-quality long-duration video generation with much lower training costs.

Contribution

The paper proposes ReHyAt, a novel hybrid attention method combining softmax and linear attention, allowing efficient, scalable video diffusion modeling with reduced training costs.

Findings

01

ReHyAt achieves state-of-the-art video quality on benchmark datasets.

02

It reduces attention complexity from quadratic to linear.

03

Training cost is decreased by two orders of magnitude.

Abstract

Recent advances in video diffusion models have shifted towards transformer-based architectures, achieving state-of-the-art video generation but at the cost of quadratic attention complexity, which severely limits scalability for longer sequences. We introduce ReHyAt, a Recurrent Hybrid Attention mechanism that combines the fidelity of softmax attention with the efficiency of linear attention, enabling chunk-wise recurrent reformulation and constant memory usage. Unlike the concurrent linear-only SANA Video, ReHyAt's hybrid design allows efficient distillation from existing softmax-based models, reducing the training cost by two orders of magnitude to ~160 GPU hours, while being competitive in the quality. Our light-weight distillation and finetuning pipeline provides a recipe that can be applied to future state-of-the-art bidirectional softmax-based models. Experiments on VBench and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Image and Video Quality Assessment · Image Enhancement Techniques