Long-Short Temporal Modeling for Efficient Action Recognition
Liyu Wu, Yuexian Zou, Can Zhang

TL;DR
This paper introduces MENet, a two-stream network with Motion Enhancement and Video-level Aggregation modules, improving long-short temporal modeling for action recognition with verified effectiveness on standard benchmarks.
Contribution
The paper proposes MENet, a novel two-stream network that effectively models long and short-term temporal dependencies for action recognition.
Findings
MENet outperforms existing methods on UCF101 and HMDB51 benchmarks.
The Motion Enhancement module improves short-term motion representation.
The Video-level Aggregation module captures long-term dependencies efficiently.
Abstract
Efficient long-short temporal modeling is key for enhancing the performance of action recognition task. In this paper, we propose a new two-stream action recognition network, termed as MENet, consisting of a Motion Enhancement (ME) module and a Video-level Aggregation (VLA) module to achieve long-short temporal modeling. Specifically, motion representations have been proved effective in capturing short-term and high-frequency action. However, current motion representations are calculated from adjacent frames, which may have poor interpretation and bring useless information (noisy or blank). Thus, for short-term motions, we design an efficient ME module to enhance the short-term motions by mingling the motion saliency among neighboring segments. As for long-term aggregations, VLA is adopted at the top of the appearance branch to integrate the long-term dependencies across all segments.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Gait Recognition and Analysis · Diabetic Foot Ulcer Assessment and Management
