Deep video representation learning: a survey
Elham Ravanbakhsh, Yongqing Liang, J. Ramanujam, Xin Li

TL;DR
This survey reviews recent methods for deep video representation learning, comparing spatial and temporal features, and discusses their strengths, limitations, and remaining challenges in video analysis tasks.
Contribution
It provides a comprehensive classification and comparison of recent spatiotemporal feature learning methods for videos, highlighting their advantages and limitations.
Findings
Spatial and temporal features have distinct strengths.
Effectiveness varies under different visual variations.
Remaining challenges include robustness and generalization.
Abstract
This paper provides a review on representation learning for videos. We classify recent spatiotemporal feature learning methods for sequential visual data and compare their pros and cons for general video analysis. Building effective features for videos is a fundamental problem in computer vision tasks involving video analysis and understanding. Existing features can be generally categorized into spatial and temporal features. Their effectiveness under variations of illumination, occlusion, view and background are discussed. Finally, we discuss the remaining challenges in existing deep video representation learning studies.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Face and Expression Recognition
