# Second-order Temporal Pooling for Action Recognition

**Authors:** Anoop Cherian, Stephen Gould

arXiv: 1704.06925 · 2018-08-08

## TL;DR

This paper introduces a novel second-order temporal pooling method for action recognition in videos, capturing richer feature interactions and improving accuracy over traditional first-order methods.

## Contribution

The paper proposes a new end-to-end learnable temporal correlation pooling scheme that captures second-order feature interactions for enhanced action recognition.

## Key findings

- Achieves state-of-the-art accuracy on HMDB-51 and UCF-101 datasets.
- Demonstrates benefits of higher-order pooling schemes with hand-crafted features.
- Validates effectiveness on multiple benchmark and fine-grained datasets.

## Abstract

Deep learning models for video-based action recognition usually generate features for short clips (consisting of a few frames); such clip-level features are aggregated to video-level representations by computing statistics on these features. Typically zero-th (max) or the first-order (average) statistics are used. In this paper, we explore the benefits of using second-order statistics. Specifically, we propose a novel end-to-end learnable feature aggregation scheme, dubbed temporal correlation pooling that generates an action descriptor for a video sequence by capturing the similarities between the temporal evolution of clip-level CNN features computed across the video. Such a descriptor, while being computationally cheap, also naturally encodes the co-activations of multiple CNN features, thereby providing a richer characterization of actions than their first-order counterparts. We also propose higher-order extensions of this scheme by computing correlations after embedding the CNN features in a reproducing kernel Hilbert space. We provide experiments on benchmark datasets such as HMDB-51 and UCF-101, fine-grained datasets such as MPII Cooking activities and JHMDB, as well as the recent Kinetics-600. Our results demonstrate the advantages of higher-order pooling schemes that when combined with hand-crafted features (as is standard practice) achieves state-of-the-art accuracy.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1704.06925/full.md

## Figures

41 figures with captions in the complete paper: https://tomesphere.com/paper/1704.06925/full.md

## References

95 references — full list in the complete paper: https://tomesphere.com/paper/1704.06925/full.md

---
Source: https://tomesphere.com/paper/1704.06925