Exploiting Spatial-Temporal Modelling and Multi-Modal Fusion for Human   Action Recognition

Dongliang He; Fu Li; Qijie Zhao; Xiang Long; Yi Fu; Shilei Wen

arXiv:1806.10319·cs.CV·June 28, 2018·21 cites

Exploiting Spatial-Temporal Modelling and Multi-Modal Fusion for Human Action Recognition

Dongliang He, Fu Li, Qijie Zhao, Xiang Long, Yi Fu, Shilei Wen

PDF

Open Access

TL;DR

This paper introduces a novel spatial-temporal network (StNet) and an improved multi-modal fusion strategy (iTXN) for human action recognition, achieving top performance on the Kinetics-600 challenge.

Contribution

It proposes StNet for enhanced joint spatial-temporal video modeling and an improved fusion method (iTXN) for integrating multiple modalities, advancing state-of-the-art results.

Findings

01

StNet RGB model achieves 78.99% top-1 accuracy.

02

Multi-modal iTXN reaches 82.35% accuracy.

03

Ensemble method achieves 85.0% top-1 accuracy, ranking first.

Abstract

In this report, our approach to tackling the task of ActivityNet 2018 Kinetics-600 challenge is described in detail. Though spatial-temporal modelling methods, which adopt either such end-to-end framework as I3D \cite{i3d} or two-stage frameworks (i.e., CNN+RNN), have been proposed in existing state-of-the-arts for this task, video modelling is far from being well solved. In this challenge, we propose spatial-temporal network (StNet) for better joint spatial-temporal modelling and comprehensively video understanding. Besides, given that multi-modal information is contained in video source, we manage to integrate both early-fusion and later-fusion strategy of multi-modal information via our proposed improved temporal Xception network (iTXN) for video understanding. Our StNet RGB single model achieves 78.99\% top-1 precision in the Kinetics-600 validation set and that of our improved…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Gait Recognition and Analysis

MethodsAverage Pooling · Depthwise Convolution · Pointwise Convolution · Global Average Pooling · Depthwise Separable Convolution · Residual Connection · Dense Connections · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Max Pooling