LLM4VG: Large Language Models Evaluation for Video Grounding
Wei Feng, Xin Wang, Hong Chen, Zeyang Zhang, Houlun Chen, Zihan Song,, Yuwei Zhou, Yuekui Yang, Haiyang Wu, Wenwu Zhu

TL;DR
This paper introduces the LLM4VG benchmark to evaluate large language models' ability to perform video grounding, revealing current models' limitations and potential improvements through combined visual and language models.
Contribution
The paper proposes a systematic benchmark for video grounding with LLMs, designs prompt methods for integrating visual descriptions, and provides comprehensive experimental analysis of model performances.
Findings
Existing VidLLMs perform poorly on video grounding tasks.
Combining LLMs with visual models shows promising potential.
Prompt design significantly influences model performance.
Abstract
Recently, researchers have attempted to investigate the capability of LLMs in handling videos and proposed several video LLM models. However, the ability of LLMs to handle video grounding (VG), which is an important time-related video task requiring the model to precisely locate the start and end timestamps of temporal moments in videos that match the given textual queries, still remains unclear and unexplored in literature. To fill the gap, in this paper, we propose the LLM4VG benchmark, which systematically evaluates the performance of different LLMs on video grounding tasks. Based on our proposed LLM4VG, we design extensive experiments to examine two groups of video LLM models on video grounding: (i) the video LLMs trained on the text-video pairs (denoted as VidLLM), and (ii) the LLMs combined with pretrained visual description models such as the video/image captioning model. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Video Analysis and Summarization
