Attention Distillation for Learning Video Representations
Miao Liu, Xin Chen, Yun Zhang, Yin Li, James M. Rehg

TL;DR
This paper introduces an attention distillation method that transfers motion representations from flow networks to RGB networks, improving video recognition performance and action localization.
Contribution
We propose a novel attention distillation technique that enhances RGB video models with motion cues learned from flow networks.
Findings
Significant performance improvements on major action benchmarks.
Attention maps effectively leverage motion cues for action localization.
Method consistently outperforms baseline RGB networks.
Abstract
We address the challenging problem of learning motion representations using deep models for video recognition. To this end, we make use of attention modules that learn to highlight regions in the video and aggregate features for recognition. Specifically, we propose to leverage output attention maps as a vehicle to transfer the learned representation from a motion (flow) network to an RGB network. We systematically study the design of attention modules, and develop a novel method for attention distillation. Our method is evaluated on major action benchmarks, and consistently improves the performance of the baseline RGB network by a significant margin. Moreover, we demonstrate that our attention maps can leverage motion cues in learning to identify the location of actions in video frames. We believe our method provides a step towards learning motion-aware representations in deep models.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Anomaly Detection Techniques and Applications
