Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding   in Long Videos

Yulin Pan; Xiangteng He; Biao Gong; Yiliang Lv; Yujun Shen; Yuxin; Peng; Deli Zhao

arXiv:2303.08345·cs.CV·February 20, 2024·1 cites

Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos

Yulin Pan, Xiangteng He, Biao Gong, Yiliang Lv, Yujun Shen, Yuxin, Peng, Deli Zhao

PDF

Open Access 1 Repo

TL;DR

This paper introduces an end-to-end framework for fast temporal grounding in long videos, enabling one-time processing to efficiently and accurately locate query-related segments across hours of footage.

Contribution

The proposed method models entire long videos in a single pass, combining coarse and fine content analysis to improve speed and accuracy over existing sliding window approaches.

Findings

01

Outperforms state-of-the-art on MAD and Ego4d datasets

02

Achieves 14.6x and 102.8x higher efficiency respectively

03

Effectively captures long-range temporal correlations

Abstract

Video temporal grounding aims to pinpoint a video segment that matches the query description. Despite the recent advance in short-form videos (\textit{e.g.}, in minutes), temporal grounding in long videos (\textit{e.g.}, in hours) is still at its early stage. To address this challenge, a common practice is to employ a sliding window, yet can be inefficient and inflexible due to the limited number of frames within the window. In this work, we propose an end-to-end framework for fast temporal grounding, which is able to model an hours-long video with \textbf{one-time} network execution. Our pipeline is formulated in a coarse-to-fine manner, where we first extract context knowledge from non-overlapped video clips (\textit{i.e.}, anchors), and then supplement the anchors that highly response to the query with detailed content knowledge. Besides the remarkably high pipeline efficiency,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

afcedf/soonet
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization