Exploring the Role of Explicit Temporal Modeling in Multimodal Large   Language Models for Video Understanding

Yun Li; Zhe Liu; Yajing Kong; Guangrui Li; Jiyuan Zhang; Chao Bian,; Feng Liu; Lina Yao; Zhenbang Sun

arXiv:2501.16786·cs.CV·January 29, 2025

Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding

Yun Li, Zhe Liu, Yajing Kong, Guangrui Li, Jiyuan Zhang, Chao Bian,, Feng Liu, Lina Yao, Zhenbang Sun

PDF

Open Access

TL;DR

This paper investigates the importance of explicit temporal modeling in multimodal large language models for video understanding, proposing a flexible encoder to compare it with implicit methods and demonstrating its effectiveness.

Contribution

Introduction of the Stackable Temporal Encoder (STE), a flexible module for explicit temporal modeling, enabling systematic comparison with implicit methods in video MLLMs.

Findings

01

Explicit temporal modeling significantly improves video understanding performance.

02

STE effectively balances token compression and temporal receptive fields.

03

Explicit modeling enhances temporal-specific understanding in videos.

Abstract

Applying Multimodal Large Language Models (MLLMs) to video understanding presents significant challenges due to the need to model temporal relations across frames. Existing approaches adopt either implicit temporal modeling, relying solely on the LLM decoder, or explicit temporal modeling, employing auxiliary temporal encoders. To investigate this debate between the two paradigms, we propose the Stackable Temporal Encoder (STE). STE enables flexible explicit temporal modeling with adjustable temporal receptive fields and token compression ratios. Using STE, we systematically compare implicit and explicit temporal modeling across dimensions such as overall performance, token compression effectiveness, and temporal-specific understanding. We also explore STE's design considerations and broader impacts as a plug-in module and in image modalities. Our findings emphasize the critical role of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsADaptive gradient method with the OPTimal convergence rate