An Effective End-to-End Solution for Multimodal Action Recognition

Songping Wang; Xiantao Hu; Yueming Lyu; and Caifeng Shan

arXiv:2506.09345·cs.CV·June 12, 2025

An Effective End-to-End Solution for Multimodal Action Recognition

Songping Wang, Xiantao Hu, Yueming Lyu, and Caifeng Shan

PDF

Open Access

TL;DR

This paper presents a comprehensive end-to-end multimodal action recognition system that leverages data augmentation, transfer learning, efficient spatial-temporal feature extraction, and prediction ensemble techniques to achieve state-of-the-art accuracy.

Contribution

The proposed solution introduces an integrated approach combining data enhancement, transfer learning, efficient multimodal feature extraction, and ensemble methods for improved action recognition performance.

Findings

01

Achieved 99% Top-1 accuracy on the leaderboard.

02

Utilized transfer learning with RGB datasets for better model adaptation.

03

Combined multiple prediction techniques to enhance accuracy.

Abstract

Recently, multimodal tasks have strongly advanced the field of action recognition with their rich multimodal information. However, due to the scarcity of tri-modal data, research on tri-modal action recognition tasks faces many challenges. To this end, we have proposed a comprehensive multimodal action recognition solution that effectively utilizes multimodal information. First, the existing data are transformed and expanded by optimizing data enhancement techniques to enlarge the training scale. At the same time, more RGB datasets are used to pre-train the backbone network, which is better adapted to the new task by means of transfer learning. Secondly, multimodal spatial features are extracted with the help of 2D CNNs and combined with the Temporal Shift Module (TSM) to achieve multimodal spatial-temporal feature extraction comparable to 3D CNNs and improve the computational…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Context-Aware Activity Recognition Systems

MethodsStochastic Weight Averaging