Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer

Bohao Xing; Deng Li; Rong Gao; Xin Liu; Heikki K\"alvi\"ainen

arXiv:2604.06783·cs.CV·April 9, 2026

Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer

Bohao Xing, Deng Li, Rong Gao, Xin Liu, Heikki K\"alvi\"ainen

PDF

1 Repo

TL;DR

This paper introduces OG-ReG Transformer, a dual-path model inspired by human visual attention, that captures both coarse and detailed spatiotemporal information for improved video understanding.

Contribution

It proposes a novel dual-path transformer architecture that mimics human glance and gaze behavior to better model motion and long-range dependencies in videos.

Findings

01

Achieves state-of-the-art results on Kinetics-400, Something-Something v2, and Diving-48 datasets.

02

Demonstrates the effectiveness of combining coarse and fine spatiotemporal features.

03

Shows competitive performance with efficient attention mechanisms.

Abstract

Recently, Transformer has made significant progress in various vision tasks. To balance computation and efficiency in video tasks, recent works heavily rely on factorized or window-based self-attention. However, these approaches split spatiotemporal correlations between regions of interest in videos, limiting the models' ability to capture motion and long-range dependencies. In this paper, we argue that, similar to the human visual system, the importance of temporal and spatial information varies across different time scales, and attention is allocated sparsely over time through glance and gaze behavior. Is equal consideration of time and space crucial for success in video tasks? Motivated by this understanding, we propose a dual-path network called the Overall Glance and Refined Gaze (OG-ReG) Transformer. The Glance path extracts coarse-grained overall spatiotemporal information, while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

linuxsino/OG-ReG
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.