Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding

Andong Deng; Zhongpai Gao; Anwesa Choudhuri; Benjamin Planche; Meng; Zheng; Bin Wang; Terrence Chen; Chen Chen; Ziyan Wu

arXiv:2411.16932·cs.CV·November 27, 2024

Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding

Andong Deng, Zhongpai Gao, Anwesa Choudhuri, Benjamin Planche, Meng, Zheng, Bin Wang, Terrence Chen, Chen Chen, Ziyan Wu

PDF

Open Access

TL;DR

Seq2Time introduces a data-driven training approach that enhances temporal awareness in video LLMs by converting image and clip datasets into sequences with temporal annotations, significantly improving performance on video grounding benchmarks.

Contribution

The paper presents a novel sequence-to-time transfer method and a unified time representation, enabling self-supervised training for better temporal understanding in long videos.

Findings

01

27.6% improvement in F1 score on YouCook2

02

44.8% increase in CIDEr score on YouCook2

03

14.7% higher recall on Charades-STA

Abstract

Temporal awareness is essential for video large language models (LLMs) to understand and reason about events within long videos, enabling applications like dense video captioning and temporal video grounding in a unified system. However, the scarcity of long videos with detailed captions and precise temporal annotations limits their temporal awareness. In this paper, we propose Seq2Time, a data-oriented training paradigm that leverages sequences of images and short video clips to enhance temporal awareness in long videos. By converting sequence positions into temporal annotations, we transform large-scale image and clip captioning datasets into sequences that mimic the temporal structure of long videos, enabling self-supervised training with abundant time-sensitive data. To enable sequence-to-time knowledge transfer, we introduce a novel time representation that unifies positional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Anomaly Detection Techniques and Applications · Human Pose and Action Recognition

MethodsContrastive Language-Image Pre-training