TimeChat: A Time-sensitive Multimodal Large Language Model for Long   Video Understanding

Shuhuai Ren; Linli Yao; Shicheng Li; Xu Sun; Lu Hou

arXiv:2312.02051·cs.CV·March 29, 2024·6 cites

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, Lu Hou

PDF

Open Access 2 Repos 3 Models 1 Datasets

TL;DR

TimeChat is a novel time-sensitive multimodal large language model designed for long video understanding, integrating timestamp-aware encoding and a sliding video Q-Former to improve temporal reasoning and localization.

Contribution

The paper introduces TimeChat with a timestamp-aware encoder and sliding Q-Former, along with an instruction-tuning dataset, advancing long video comprehension capabilities.

Findings

01

Achieves significant improvements in dense captioning, temporal grounding, and highlight detection.

02

Demonstrates strong zero-shot temporal localization and reasoning abilities.

03

Outperforms state-of-the-art models on multiple long video understanding benchmarks.

Abstract

This work proposes TimeChat, a time-sensitive multimodal large language model specifically designed for long video understanding. Our model incorporates two key architectural contributions: (1) a timestamp-aware frame encoder that binds visual content with the timestamp of each frame, and (2) a sliding video Q-Former that produces a video token sequence of varying lengths to accommodate videos of various durations. Additionally, we construct an instruction-tuning dataset, encompassing 6 tasks and a total of 125K instances, to further enhance TimeChat's instruction-following performance. Experiment results across various video understanding tasks, such as dense captioning, temporal grounding, and highlight detection, demonstrate TimeChat's strong zero-shot temporal localization and reasoning capabilities. For example, it achieves +9.2 F1 score and +2.8 CIDEr on YouCook2, +5.8 HIT@1 on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

ShuhuaiRen/TimeIT
dataset· 339 dl
339 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning