TL;DR
This paper introduces LAPS, a novel video transformer method combining long-term leap attention and short-term periodic shift, reducing computational complexity while maintaining competitive accuracy on video classification tasks.
Contribution
The paper proposes LAPS, a new attention mechanism for video transformers that efficiently captures long-term and short-term temporal information with minimal additional computation.
Findings
Achieves competitive accuracy on Kinetics-400 benchmark.
Reduces computational complexity by approximately 2.6%.
Maintains performance with zero extra parameters.
Abstract
Video transformer naturally incurs a heavier computation burden than a static vision transformer, as the former processes times longer sequence than the latter under the current attention of quadratic complexity . The existing works treat the temporal axis as a simple extension of spatial axes, focusing on shortening the spatio-temporal sequence by either generic pooling or local windowing without utilizing temporal redundancy. However, videos naturally contain redundant information between neighboring frames; thereby, we could potentially suppress attention on visually similar frames in a dilated manner. Based on this hypothesis, we propose the LAPS, a long-term ``\textbf{\textit{Leap Attention}}'' (LA), short-term ``\textbf{\textit{Periodic Shift}}'' (\textit{P}-Shift) module for video transformers, with complexity. Specifically, the ``LA'' groups long-term…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
