LLM4VG: Large Language Models Evaluation for Video Grounding

Wei Feng; Xin Wang; Hong Chen; Zeyang Zhang; Houlun Chen; Zihan Song,; Yuwei Zhou; Yuekui Yang; Haiyang Wu; Wenwu Zhu

arXiv:2312.14206·cs.CV·September 13, 2024·1 cites

LLM4VG: Large Language Models Evaluation for Video Grounding

Wei Feng, Xin Wang, Hong Chen, Zeyang Zhang, Houlun Chen, Zihan Song,, Yuwei Zhou, Yuekui Yang, Haiyang Wu, Wenwu Zhu

PDF

Open Access

TL;DR

This paper introduces the LLM4VG benchmark to evaluate large language models' ability to perform video grounding, revealing current models' limitations and potential improvements through combined visual and language models.

Contribution

The paper proposes a systematic benchmark for video grounding with LLMs, designs prompt methods for integrating visual descriptions, and provides comprehensive experimental analysis of model performances.

Findings

01

Existing VidLLMs perform poorly on video grounding tasks.

02

Combining LLMs with visual models shows promising potential.

03

Prompt design significantly influences model performance.

Abstract

Recently, researchers have attempted to investigate the capability of LLMs in handling videos and proposed several video LLM models. However, the ability of LLMs to handle video grounding (VG), which is an important time-related video task requiring the model to precisely locate the start and end timestamps of temporal moments in videos that match the given textual queries, still remains unclear and unexplored in literature. To fill the gap, in this paper, we propose the LLM4VG benchmark, which systematically evaluates the performance of different LLMs on video grounding tasks. Based on our proposed LLM4VG, we design extensive experiments to examine two groups of video LLM models on video grounding: (i) the video LLMs trained on the text-video pairs (denoted as VidLLM), and (ii) the LLMs combined with pretrained visual description models such as the video/image captioning model. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Video Analysis and Summarization