CAST: Cross-Attention in Space and Time for Video Action Recognition

Dongho Lee; Jongseo Lee; Jinwoo Choi

arXiv:2311.18825·cs.CV·September 4, 2024·5 cites

CAST: Cross-Attention in Space and Time for Video Action Recognition

Dongho Lee, Jongseo Lee, Jinwoo Choi

PDF

Open Access 1 Repo 1 Video

TL;DR

CAST introduces a novel cross-attention architecture for video action recognition that effectively balances spatial and temporal understanding using only RGB input, leading to improved performance across multiple benchmarks.

Contribution

The paper proposes a new two-stream architecture with a bottleneck cross-attention mechanism enabling better spatial-temporal information exchange in video recognition.

Findings

01

Consistently outperforms existing methods on EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400.

02

Achieves balanced spatio-temporal understanding with only RGB input.

03

Demonstrates robustness across diverse video datasets.

Abstract

Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

khu-vll/cast
pytorch

Videos

CAST: Cross-Attention in Space and Time for Video Action Recognition· slideslive

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications