Spatio-Temporal Fusion Networks for Action Recognition
Sangwoo Cho, Hassan Foroosh

TL;DR
This paper introduces a novel spatio-temporal fusion network (STFN) that effectively captures and integrates temporal dynamics from entire videos, significantly improving action recognition performance.
Contribution
The paper proposes a new STFN architecture with Residual Inception blocks for enhanced temporal feature extraction and demonstrates its applicability across different networks for superior video classification.
Findings
Achieved state-of-the-art results on UCF101 and HMDB51 datasets.
Effectively captures local and global temporal dynamics.
Enhances video-level representation for action recognition.
Abstract
The video based CNN works have focused on effective ways to fuse appearance and motion networks, but they typically lack utilizing temporal information over video frames. In this work, we present a novel spatio-temporal fusion network (STFN) that integrates temporal dynamics of appearance and motion information from entire videos. The captured temporal dynamic information is then aggregated for a better video level representation and learned via end-to-end training. The spatio-temporal fusion network consists of two set of Residual Inception blocks that extract temporal dynamics and a fusion connection for appearance and motion features. The benefits of STFN are: (a) it captures local and global temporal dynamics of complementary data to learn video-wide information; and (b) it is applicable to any network for video classification to boost performance. We explore a variety of design…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
