ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos
Tanveer Hannan, Md Mohaiminul Islam, Jindong Gu, Thomas Seidl, Gedas, Bertasius

TL;DR
ReVisionLLM is a recursive vision-language model that effectively performs temporal grounding in hour-long videos by progressively refining its focus, outperforming previous methods and handling videos of varying lengths.
Contribution
It introduces the first VLM capable of temporal grounding in hour-long videos using a recursive, hierarchical approach inspired by human search strategies.
Findings
Outperforms previous state-of-the-art methods by +2.6% [email protected] on MAD dataset.
Handles videos from minutes to hours seamlessly.
Employs a hierarchical training strategy starting from short clips to long videos.
Abstract
Large language models (LLMs) excel at retrieving information from lengthy text, but their vision-language counterparts (VLMs) face difficulties with hour-long videos, especially for temporal grounding. Specifically, these VLMs are constrained by frame limitations, often losing essential temporal details needed for accurate event localization in extended video content. We propose ReVisionLLM, a recursive vision-language model designed to locate events in hour-long videos. Inspired by human search strategies, our model initially targets broad segments of interest, progressively revising its focus to pinpoint exact temporal boundaries. Our model can seamlessly handle videos of vastly different lengths, from minutes to hours. We also introduce a hierarchical training strategy that starts with short clips to capture distinct events and progressively extends to longer videos. To our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsFocus
