Video Understanding: Through A Temporal Lens
Thong Thanh Nguyen

TL;DR
This thesis advances video understanding by developing novel temporal modeling techniques, including annotation, fine-tuning, long-term modeling, and contrastive learning, supported by new benchmarks and empirical insights.
Contribution
It introduces five key innovations: an annotation framework, a parameter-efficient fine-tuning method, long-term video modeling with SSL, a relation-focused contrastive framework, and an empirical study on LVLMs.
Findings
Explicit temporal modeling improves video content reasoning.
New benchmarks for egocentric and feature-length videos.
Identifies visual-language interface as a bottleneck for temporal reasoning.
Abstract
This thesis explores the central question of how to leverage temporal relations among video elements to advance video understanding. Addressing the limitations of existing methods, the work presents a five-fold contribution: (1) an automatic annotation framework that utilizes large vision-language models and a noise-robust contrastive learning objective with a subtractive angular margin; (2) a parameter-efficient fine-tuning strategy using "recurrent adapters" to capture temporal dynamics in low-data regimes; (3) the integration of State Space Layers (SSL) for efficient long-form video modeling, supported by the introduction of two new long-term benchmarks for egocentric and feature-length content; (4) a novel contrastive learning framework designed to explicitly model fine-grained relations between motions and video moments; and (5) a comprehensive empirical study on Large…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
