Spatio-Temporal Fusion Networks for Action Recognition

Sangwoo Cho; Hassan Foroosh

arXiv:1906.06822·cs.CV·June 18, 2019

Spatio-Temporal Fusion Networks for Action Recognition

Sangwoo Cho, Hassan Foroosh

PDF

TL;DR

This paper introduces a novel spatio-temporal fusion network (STFN) that effectively captures and integrates temporal dynamics from entire videos, significantly improving action recognition performance.

Contribution

The paper proposes a new STFN architecture with Residual Inception blocks for enhanced temporal feature extraction and demonstrates its applicability across different networks for superior video classification.

Findings

01

Achieved state-of-the-art results on UCF101 and HMDB51 datasets.

02

Effectively captures local and global temporal dynamics.

03

Enhances video-level representation for action recognition.

Abstract

The video based CNN works have focused on effective ways to fuse appearance and motion networks, but they typically lack utilizing temporal information over video frames. In this work, we present a novel spatio-temporal fusion network (STFN) that integrates temporal dynamics of appearance and motion information from entire videos. The captured temporal dynamic information is then aggregated for a better video level representation and learned via end-to-end training. The spatio-temporal fusion network consists of two set of Residual Inception blocks that extract temporal dynamics and a fusion connection for appearance and motion features. The benefits of STFN are: (a) it captures local and global temporal dynamics of complementary data to learn video-wide information; and (b) it is applicable to any network for video classification to boost performance. We explore a variety of design…

Figures17

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.