Temporal Grounding of Activities using Multimodal Large Language Models
Young Chol Song

TL;DR
This paper investigates how multimodal large language models can be used to improve the temporal localization of activities in videos, demonstrating that combining image and text models with instruction tuning enhances performance.
Contribution
It introduces a two-stage approach using multimodal LLMs for activity localization and shows that instruction-tuning smaller models improves their temporal reasoning abilities.
Findings
Outperforms existing video-based LLMs in activity localization
Instruction-tuning enhances model performance in identifying activity intervals
Effective on the Charades-STA dataset
Abstract
Temporal grounding of activities, the identification of specific time intervals of actions within a larger event context, is a critical task in video understanding. Recent advancements in multimodal large language models (LLMs) offer new opportunities for enhancing temporal reasoning capabilities. In this paper, we evaluate the effectiveness of combining image-based and text-based large language models (LLMs) in a two-stage approach for temporal activity localization. We demonstrate that our method outperforms existing video-based LLMs. Furthermore, we explore the impact of instruction-tuning on a smaller multimodal LLM, showing that refining its ability to process action queries leads to more expressive and informative outputs, thereby enhancing its performance in identifying specific time intervals of activities. Our experimental results on the Charades-STA dataset highlight the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
