Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding
Andong Deng, Zhongpai Gao, Anwesa Choudhuri, Benjamin Planche, Meng, Zheng, Bin Wang, Terrence Chen, Chen Chen, Ziyan Wu

TL;DR
Seq2Time introduces a data-driven training approach that enhances temporal awareness in video LLMs by converting image and clip datasets into sequences with temporal annotations, significantly improving performance on video grounding benchmarks.
Contribution
The paper presents a novel sequence-to-time transfer method and a unified time representation, enabling self-supervised training for better temporal understanding in long videos.
Findings
27.6% improvement in F1 score on YouCook2
44.8% increase in CIDEr score on YouCook2
14.7% higher recall on Charades-STA
Abstract
Temporal awareness is essential for video large language models (LLMs) to understand and reason about events within long videos, enabling applications like dense video captioning and temporal video grounding in a unified system. However, the scarcity of long videos with detailed captions and precise temporal annotations limits their temporal awareness. In this paper, we propose Seq2Time, a data-oriented training paradigm that leverages sequences of images and short video clips to enhance temporal awareness in long videos. By converting sequence positions into temporal annotations, we transform large-scale image and clip captioning datasets into sequences that mimic the temporal structure of long videos, enabling self-supervised training with abundant time-sensitive data. To enable sequence-to-time knowledge transfer, we introduce a novel time representation that unifies positional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Anomaly Detection Techniques and Applications · Human Pose and Action Recognition
MethodsContrastive Language-Image Pre-training
