On the Consistency of Video Large Language Models in Temporal   Comprehension

Minjoon Jung; Junbin Xiao; Byoung-Tak Zhang; Angela Yao

arXiv:2411.12951·cs.CV·March 18, 2025

On the Consistency of Video Large Language Models in Temporal Comprehension

Minjoon Jung, Junbin Xiao, Byoung-Tak Zhang, Angela Yao

PDF

Open Access 1 Repo 4 Models 2 Datasets

TL;DR

This paper investigates the prediction consistency of Video Large Language Models in temporal comprehension, revealing their sensitivity to variations and proposing a new tuning method to improve robustness and reliability.

Contribution

The study highlights the inconsistency issues in current Video-LLMs and introduces event temporal verification tuning to enhance their temporal grounding consistency.

Findings

01

Current Video-LLMs are sensitive to content and query variations.

02

Prompting and instruction-tuning methods show unstable improvements.

03

Event temporal verification tuning significantly improves grounding and consistency.

Abstract

Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments. Yet, such temporal comprehension capabilities are neither well-studied nor understood. So we conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding. After the model identifies an initial moment within the video content, we apply a series of probes to check if the model's responses align with this initial grounding as an indicator of reliable comprehension. Our results reveal that current Video-LLMs are sensitive to variations in video contents, language queries, and task settings, unveiling severe deficiencies in maintaining consistency. We further explore common prompting and instruction-tuning methods as potential solutions, but find that their improvements are often unstable. To that end, we propose event temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

minjoong507/consistency-of-video-llm
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling

MethodsALIGN