TL;DR
This paper introduces OG-ReG Transformer, a dual-path model inspired by human visual attention, that captures both coarse and detailed spatiotemporal information for improved video understanding.
Contribution
It proposes a novel dual-path transformer architecture that mimics human glance and gaze behavior to better model motion and long-range dependencies in videos.
Findings
Achieves state-of-the-art results on Kinetics-400, Something-Something v2, and Diving-48 datasets.
Demonstrates the effectiveness of combining coarse and fine spatiotemporal features.
Shows competitive performance with efficient attention mechanisms.
Abstract
Recently, Transformer has made significant progress in various vision tasks. To balance computation and efficiency in video tasks, recent works heavily rely on factorized or window-based self-attention. However, these approaches split spatiotemporal correlations between regions of interest in videos, limiting the models' ability to capture motion and long-range dependencies. In this paper, we argue that, similar to the human visual system, the importance of temporal and spatial information varies across different time scales, and attention is allocated sparsely over time through glance and gaze behavior. Is equal consideration of time and space crucial for success in video tasks? Motivated by this understanding, we propose a dual-path network called the Overall Glance and Refined Gaze (OG-ReG) Transformer. The Glance path extracts coarse-grained overall spatiotemporal information, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
