ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in   Hour-Long Videos

Tanveer Hannan; Md Mohaiminul Islam; Jindong Gu; Thomas Seidl; Gedas; Bertasius

arXiv:2411.14901·cs.CV·November 25, 2024

ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos

Tanveer Hannan, Md Mohaiminul Islam, Jindong Gu, Thomas Seidl, Gedas, Bertasius

PDF

Open Access 1 Repo

TL;DR

ReVisionLLM is a recursive vision-language model that effectively performs temporal grounding in hour-long videos by progressively refining its focus, outperforming previous methods and handling videos of varying lengths.

Contribution

It introduces the first VLM capable of temporal grounding in hour-long videos using a recursive, hierarchical approach inspired by human search strategies.

Findings

01

Outperforms previous state-of-the-art methods by +2.6% [email protected] on MAD dataset.

02

Handles videos from minutes to hours seamlessly.

03

Employs a hierarchical training strategy starting from short clips to long videos.

Abstract

Large language models (LLMs) excel at retrieving information from lengthy text, but their vision-language counterparts (VLMs) face difficulties with hour-long videos, especially for temporal grounding. Specifically, these VLMs are constrained by frame limitations, often losing essential temporal details needed for accurate event localization in extended video content. We propose ReVisionLLM, a recursive vision-language model designed to locate events in hour-long videos. Inspired by human search strategies, our model initially targets broad segments of interest, progressively revising its focus to pinpoint exact temporal boundaries. Our model can seamlessly handle videos of vastly different lengths, from minutes to hours. We also introduce a hierarchical training strategy that starts with short clips to capture distinct events and progressively extends to longer videos. To our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tanveer81/revisionllm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsFocus