# Deep Local Video Feature for Action Recognition

**Authors:** Zhenzhong Lan, Yi Zhu, Alexander G. Hauptmann

arXiv: 1701.07368 · 2017-01-31

## TL;DR

This paper proposes a method to represent videos for action recognition by extracting local CNN features from sampled frames, aggregating them into global features, and training a mapping for improved recognition accuracy.

## Contribution

It introduces a novel approach of using local CNN features with aggregation and mapping, addressing GPU memory limitations in end-to-end video analysis.

## Key findings

- Max pooling of local features improves accuracy
- Method achieves significant performance gains on HMDB51 and UCF101
- Local feature aggregation is effective for action recognition

## Abstract

We investigate the problem of representing an entire video using CNN features for human action recognition. Currently, limited by GPU memory, we have not been able to feed a whole video into CNN/RNNs for end-to-end learning. A common practice is to use sampled frames as inputs and video labels as supervision. One major problem of this popular approach is that the local samples may not contain the information indicated by global labels. To deal with this problem, we propose to treat the deep networks trained on local inputs as local feature extractors. After extracting local features, we aggregate them into global features and train another mapping function on the same training data to map the global features into global labels. We study a set of problems regarding this new type of local features such as how to aggregate them into global features. Experimental results on HMDB51 and UCF101 datasets show that, for these new local features, a simple maximum pooling on the sparsely sampled features lead to significant performance improvement.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1701.07368/full.md

## References

21 references — full list in the complete paper: https://tomesphere.com/paper/1701.07368/full.md

---
Source: https://tomesphere.com/paper/1701.07368