Deep video representation learning: a survey

Elham Ravanbakhsh; Yongqing Liang; J. Ramanujam; Xin Li

arXiv:2405.06574·cs.CV·May 13, 2024

Deep video representation learning: a survey

Elham Ravanbakhsh, Yongqing Liang, J. Ramanujam, Xin Li

PDF

Open Access

TL;DR

This survey reviews recent methods for deep video representation learning, comparing spatial and temporal features, and discusses their strengths, limitations, and remaining challenges in video analysis tasks.

Contribution

It provides a comprehensive classification and comparison of recent spatiotemporal feature learning methods for videos, highlighting their advantages and limitations.

Findings

01

Spatial and temporal features have distinct strengths.

02

Effectiveness varies under different visual variations.

03

Remaining challenges include robustness and generalization.

Abstract

This paper provides a review on representation learning for videos. We classify recent spatiotemporal feature learning methods for sequential visual data and compare their pros and cons for general video analysis. Building effective features for videos is a fundamental problem in computer vision tasks involving video analysis and understanding. Existing features can be generally categorized into spatial and temporal features. Their effectiveness under variations of illumination, occlusion, view and background are discussed. Finally, we discuss the remaining challenges in existing deep video representation learning studies.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Face and Expression Recognition