Video Understanding: From Geometry and Semantics to Unified Models

Zhaochong An; Zirui Li; Mingqiao Ye; Feng Qiao; Jiaang Li; Zongwei Wu; Vishal Thengane; Chengzu Li; Lei Li; Luc Van Gool; Guolei Sun; Serge Belongie

arXiv:2603.17840·cs.CV·March 19, 2026

Video Understanding: From Geometry and Semantics to Unified Models

Zhaochong An, Zirui Li, Mingqiao Ye, Feng Qiao, Jiaang Li, Zongwei Wu, Vishal Thengane, Chengzu Li, Lei Li, Luc Van Gool, Guolei Sun, Serge Belongie

PDF

Open Access

TL;DR

This survey reviews the evolution of video understanding, emphasizing the shift from isolated tasks to unified models that integrate geometry and semantics for more comprehensive and adaptable visual reasoning.

Contribution

It provides a structured overview of low-level geometry, high-level semantics, and unified models, highlighting recent trends and open challenges in developing robust, scalable video foundation models.

Findings

01

Unified modeling paradigms are increasingly adopted in video understanding.

02

Recent progress shows a shift towards adaptable, task-agnostic models.

03

Open challenges include robustness, scalability, and comprehensive reasoning.

Abstract

Video understanding aims to enable models to perceive, reason about, and interact with the dynamic visual world. In contrast to image understanding, video understanding inherently requires modeling temporal dynamics and evolving visual context, placing stronger demands on spatiotemporal reasoning and making it a foundational problem in computer vision. In this survey, we present a structured overview of video understanding by organizing the literature into three complementary perspectives: low-level video geometry understanding, high-level semantic understanding, and unified video understanding models. We further highlight a broader shift from isolated, task-specific pipelines toward unified modeling paradigms that can be adapted to diverse downstream objectives, enabling a more systematic view of recent progress. By consolidating these perspectives, this survey provides a coherent map…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Vision and Imaging