TRecViT: A Recurrent Video Transformer

Viorica P\u{a}tr\u{a}ucean; Xu Owen He; Joseph Heyward; Chuhan Zhang; Mehdi S. M. Sajjadi; George-Cristian Muraru; Artem Zholus; Mahdi Karami; Ross Goroshin; Yutian Chen; Simon Osindero; Jo\~ao Carreira; Razvan Pascanu

arXiv:2412.14294·cs.CV·February 17, 2026

TRecViT: A Recurrent Video Transformer

Viorica P\u{a}tr\u{a}ucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, Jo\~ao Carreira, Razvan Pascanu

PDF

Open Access 1 Repo

TL;DR

TRecViT introduces a causal video transformer architecture that combines recurrent units, self-attention, and MLPs, achieving high performance with significantly reduced computational resources and real-time inference capability.

Contribution

It is the first causal video model in the state-space models family, combining recurrent and attention mechanisms for efficient and effective video modeling.

Findings

01

Outperforms or matches ViViT-L on large datasets with fewer parameters.

02

Achieves real-time inference at 300 frames per second.

03

State-of-the-art results on SSv2 among causal models.

Abstract

We propose a novel block for \emph{causal} video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture \emph{TRecViT} is causal and shows strong performance on sparse and dense tasks, trained in supervised or self-supervised regimes, being the first causal video model in the state-space models family. Notably, our model outperforms or is on par with the popular (non-causal) ViViT-L model on large scale video datasets (SSv2, Kinetics400), while having $3 \times$ less parameters, $12 \times$ smaller memory footprint, and $5 \times$ lower FLOPs count than the full self-attention ViViT, with an inference throughput of about 300 frames per second, running comfortably in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-deepmind/trecvit
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Video Analysis and Summarization · Video Coding and Compression Technologies

MethodsSoftmax · Attention Is All You Need