Convolutional Two-Stream Network Fusion for Video Action Recognition

Christoph Feichtenhofer; Axel Pinz; Andrew Zisserman

arXiv:1604.06573·cs.CV·September 27, 2016·372 cites

Convolutional Two-Stream Network Fusion for Video Action Recognition

Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman

PDF

Open Access 2 Repos

TL;DR

This paper introduces a novel convolutional two-stream network architecture that effectively fuses appearance and motion information at multiple layers for improved video action recognition, achieving state-of-the-art results.

Contribution

The paper presents a new ConvNet fusion architecture that combines spatial and temporal streams at convolutional layers, optimizing performance and efficiency for video action recognition.

Findings

01

Fusing at convolutional layers saves parameters without performance loss.

02

Late fusion at the last convolutional layer improves accuracy.

03

Pooling features over spatiotemporal neighborhoods enhances recognition.

Abstract

Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters; (ii) that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy; finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Anomaly Detection Techniques and Applications

MethodsSoftmax · Convolution