Number it: Temporal Grounding Videos like Flipping Manga
Yongliang Wu, Xinting Hu, Yuyang Sun, Yizhou Zhou, Wenbo Zhu, Fengyun, Rao, Bernt Schiele, Xu Yang

TL;DR
NumPro introduces a numerical prompting technique that enables Video Large Language Models to perform precise temporal grounding by treating videos as sequences of numbered frames, significantly improving accuracy without extra computational cost.
Contribution
The paper presents NumPro, a novel numerical prompting method that enhances Vid-LLMs' ability to perform temporal localization in videos, achieving state-of-the-art results.
Findings
NumPro improves VTG performance by up to 6.9% mIoU.
NumPro enhances highlight detection by 8.5% mAP.
The method requires no additional computational cost.
Abstract
Video Large Language Models (Vid-LLMs) have made remarkable advancements in comprehending video content for QA dialogue. However, they struggle to extend this visual understanding to tasks requiring precise temporal localization, known as Video Temporal Grounding (VTG). To address this gap, we introduce Number-Prompt (NumPro), a novel method that empowers Vid-LLMs to bridge visual comprehension with temporal grounding by adding unique numerical identifiers to each video frame. Treating a video as a sequence of numbered frame images, NumPro transforms VTG into an intuitive process: flipping through manga panels in sequence. This allows Vid-LLMs to "read" event timelines, accurately linking visual content with corresponding temporal information. Our experiments demonstrate that NumPro significantly boosts VTG performance of top-tier Vid-LLMs without additional computational cost.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Games and Media · Multimedia Communication and Technology · Video Analysis and Summarization
