MVAFormer: RGB-based Multi-View Spatio-Temporal Action Recognition with Transformer

Taiga Yamane; Satoshi Suzuki; Ryo Masumura; Shotaro Tora

arXiv:2511.02473·cs.CV·November 5, 2025

MVAFormer: RGB-based Multi-View Spatio-Temporal Action Recognition with Transformer

Taiga Yamane, Satoshi Suzuki, Ryo Masumura, Shotaro Tora

PDF

Open Access

TL;DR

MVAFormer introduces a transformer-based multi-view cooperation module for spatio-temporal action recognition, effectively utilizing feature maps to preserve spatial information and improve recognition accuracy in multi-view, sequential settings.

Contribution

The paper proposes a novel transformer-based cooperation module that uses feature maps for multi-view spatio-temporal action recognition, addressing limitations of previous methods.

Findings

01

Outperforms baselines by approximately 4.4 F-measure points.

02

Effectively models relationships between multiple views.

03

Preserves spatial information through feature map utilization.

Abstract

Multi-view action recognition aims to recognize human actions using multiple camera views and deals with occlusion caused by obstacles or crowds. In this task, cooperation among views, which generates a joint representation by combining multiple views, is vital. Previous studies have explored promising cooperation methods for improving performance. However, since their methods focus only on the task setting of recognizing a single action from an entire video, they are not applicable to the recently popular spatio-temporal action recognition~(STAR) setting, in which each person's action is recognized sequentially. To address this problem, this paper proposes a multi-view action recognition method for the STAR setting, called MVAFormer. In MVAFormer, we introduce a novel transformer-based cooperation module among views. In contrast to previous studies, which utilize embedding vectors with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Robot Manipulation and Learning