TL;DR
This paper introduces trajectory-pooled deep-convolutional descriptors (TDD), a new video feature representation that combines deep learning and trajectory-based pooling to improve human action recognition accuracy.
Contribution
The paper proposes TDD, a novel video descriptor that integrates deep convolutional features with trajectory-constrained pooling and normalization methods, enhancing action recognition performance.
Findings
TDD outperforms previous hand-crafted and deep-learned features.
Achieves state-of-the-art results on HMDB51 and UCF101 datasets.
Demonstrates robustness through normalization techniques.
Abstract
Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features and deep-learned features. Specifically, we utilize deep architectures to learn discriminative convolutional feature maps, and conduct trajectory-constrained pooling to aggregate these convolutional features into effective descriptors. To enhance the robustness of TDDs, we design two normalization methods to transform convolutional feature maps, namely spatiotemporal normalization and channel normalization. The advantages of our features come from (i) TDDs are automatically learned and contain high discriminative capacity compared with those hand-crafted features; (ii) TDDs take account of the intrinsic characteristics of temporal dimension and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
