Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge
Yuxuan Wang, Yueqian Wang, Pengfei Wu, Jianxin Liang, Dongyan Zhao,, Yang Liu, Zilong Zheng

TL;DR
This paper introduces the Temporal Grounding Bridge (TGB), a framework that significantly improves the temporal understanding and extrapolation capabilities of multimodal large language models for long-form video analysis.
Contribution
The paper presents a novel TGB framework with an efficient multi-span temporal grounding algorithm, a length extrapolation training paradigm, and a bootstrapping approach to enhance MLLMs without additional annotations.
Findings
TGB improves temporal grounding in MLLMs across seven benchmarks.
The model scales from 4 to 16 frames without performance loss.
Significant performance gains over prior MLLMs are demonstrated.
Abstract
Despite progress in multimodal large language models (MLLMs), the challenge of interpreting long-form videos in response to linguistic queries persists, largely due to the inefficiency in temporal grounding and limited pre-trained context window size. In this work, we introduce Temporal Grounding Bridge (TGB), a novel framework that bootstraps MLLMs with advanced temporal grounding capabilities and broadens their contextual scope. Our framework significantly enhances the temporal capabilities of current MLLMs through three key innovations: an efficient multi-span temporal grounding algorithm applied to low-dimension temporal features projected from flow; a multimodal length extrapolation training paradigm that utilizes low-dimension temporal features to extend the training context window size; and a bootstrapping framework that bridges our model with pluggable MLLMs without requiring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
MethodsSemi-Pseudo-Label
