Video Understanding: From Geometry and Semantics to Unified Models
Zhaochong An, Zirui Li, Mingqiao Ye, Feng Qiao, Jiaang Li, Zongwei Wu, Vishal Thengane, Chengzu Li, Lei Li, Luc Van Gool, Guolei Sun, Serge Belongie

TL;DR
This survey reviews the evolution of video understanding, emphasizing the shift from isolated tasks to unified models that integrate geometry and semantics for more comprehensive and adaptable visual reasoning.
Contribution
It provides a structured overview of low-level geometry, high-level semantics, and unified models, highlighting recent trends and open challenges in developing robust, scalable video foundation models.
Findings
Unified modeling paradigms are increasingly adopted in video understanding.
Recent progress shows a shift towards adaptable, task-agnostic models.
Open challenges include robustness, scalability, and comprehensive reasoning.
Abstract
Video understanding aims to enable models to perceive, reason about, and interact with the dynamic visual world. In contrast to image understanding, video understanding inherently requires modeling temporal dynamics and evolving visual context, placing stronger demands on spatiotemporal reasoning and making it a foundational problem in computer vision. In this survey, we present a structured overview of video understanding by organizing the literature into three complementary perspectives: low-level video geometry understanding, high-level semantic understanding, and unified video understanding models. We further highlight a broader shift from isolated, task-specific pipelines toward unified modeling paradigms that can be adapted to diverse downstream objectives, enabling a more systematic view of recent progress. By consolidating these perspectives, this survey provides a coherent map…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Vision and Imaging
