TRecViT: A Recurrent Video Transformer
Viorica P\u{a}tr\u{a}ucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, Jo\~ao Carreira, Razvan Pascanu

TL;DR
TRecViT introduces a causal video transformer architecture that combines recurrent units, self-attention, and MLPs, achieving high performance with significantly reduced computational resources and real-time inference capability.
Contribution
It is the first causal video model in the state-space models family, combining recurrent and attention mechanisms for efficient and effective video modeling.
Findings
Outperforms or matches ViViT-L on large datasets with fewer parameters.
Achieves real-time inference at 300 frames per second.
State-of-the-art results on SSv2 among causal models.
Abstract
We propose a novel block for \emph{causal} video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture \emph{TRecViT} is causal and shows strong performance on sparse and dense tasks, trained in supervised or self-supervised regimes, being the first causal video model in the state-space models family. Notably, our model outperforms or is on par with the popular (non-causal) ViViT-L model on large scale video datasets (SSv2, Kinetics400), while having less parameters, smaller memory footprint, and lower FLOPs count than the full self-attention ViViT, with an inference throughput of about 300 frames per second, running comfortably in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Video Analysis and Summarization · Video Coding and Compression Technologies
MethodsSoftmax · Attention Is All You Need
