Efficient Temporal Extrapolation of Multimodal Large Language Models   with Temporal Grounding Bridge

Yuxuan Wang; Yueqian Wang; Pengfei Wu; Jianxin Liang; Dongyan Zhao,; Yang Liu; Zilong Zheng

arXiv:2402.16050·cs.CV·October 4, 2024·1 cites

Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge

Yuxuan Wang, Yueqian Wang, Pengfei Wu, Jianxin Liang, Dongyan Zhao,, Yang Liu, Zilong Zheng

PDF

Open Access 2 Repos

TL;DR

This paper introduces the Temporal Grounding Bridge (TGB), a framework that significantly improves the temporal understanding and extrapolation capabilities of multimodal large language models for long-form video analysis.

Contribution

The paper presents a novel TGB framework with an efficient multi-span temporal grounding algorithm, a length extrapolation training paradigm, and a bootstrapping approach to enhance MLLMs without additional annotations.

Findings

01

TGB improves temporal grounding in MLLMs across seven benchmarks.

02

The model scales from 4 to 16 frames without performance loss.

03

Significant performance gains over prior MLLMs are demonstrated.

Abstract

Despite progress in multimodal large language models (MLLMs), the challenge of interpreting long-form videos in response to linguistic queries persists, largely due to the inefficiency in temporal grounding and limited pre-trained context window size. In this work, we introduce Temporal Grounding Bridge (TGB), a novel framework that bootstraps MLLMs with advanced temporal grounding capabilities and broadens their contextual scope. Our framework significantly enhances the temporal capabilities of current MLLMs through three key innovations: an efficient multi-span temporal grounding algorithm applied to low-dimension temporal features projected from flow; a multimodal length extrapolation training paradigm that utilizes low-dimension temporal features to extend the training context window size; and a bootstrapping framework that bridges our model with pluggable MLLMs without requiring…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition

MethodsSemi-Pseudo-Label