Temporal Grounding of Activities using Multimodal Large Language Models

Young Chol Song

arXiv:2407.06157·cs.CV·July 9, 2024

Temporal Grounding of Activities using Multimodal Large Language Models

Young Chol Song

PDF

Open Access

TL;DR

This paper investigates how multimodal large language models can be used to improve the temporal localization of activities in videos, demonstrating that combining image and text models with instruction tuning enhances performance.

Contribution

It introduces a two-stage approach using multimodal LLMs for activity localization and shows that instruction-tuning smaller models improves their temporal reasoning abilities.

Findings

01

Outperforms existing video-based LLMs in activity localization

02

Instruction-tuning enhances model performance in identifying activity intervals

03

Effective on the Charades-STA dataset

Abstract

Temporal grounding of activities, the identification of specific time intervals of actions within a larger event context, is a critical task in video understanding. Recent advancements in multimodal large language models (LLMs) offer new opportunities for enhancing temporal reasoning capabilities. In this paper, we evaluate the effectiveness of combining image-based and text-based large language models (LLMs) in a two-stage approach for temporal activity localization. We demonstrate that our method outperforms existing video-based LLMs. Furthermore, we explore the impact of instruction-tuning on a smaller multimodal LLM, showing that refining its ability to process action queries leads to more expressive and informative outputs, thereby enhancing its performance in identifying specific time intervals of activities. Our experimental results on the Charades-STA dataset highlight the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems