On the Consistency of Video Large Language Models in Temporal Comprehension
Minjoon Jung, Junbin Xiao, Byoung-Tak Zhang, Angela Yao

TL;DR
This paper investigates the prediction consistency of Video Large Language Models in temporal comprehension, revealing their sensitivity to variations and proposing a new tuning method to improve robustness and reliability.
Contribution
The study highlights the inconsistency issues in current Video-LLMs and introduces event temporal verification tuning to enhance their temporal grounding consistency.
Findings
Current Video-LLMs are sensitive to content and query variations.
Prompting and instruction-tuning methods show unstable improvements.
Event temporal verification tuning significantly improves grounding and consistency.
Abstract
Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments. Yet, such temporal comprehension capabilities are neither well-studied nor understood. So we conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding. After the model identifies an initial moment within the video content, we apply a series of probes to check if the model's responses align with this initial grounding as an indicator of reliable comprehension. Our results reveal that current Video-LLMs are sensitive to variations in video contents, language queries, and task settings, unveiling severe deficiencies in maintaining consistency. We further explore common prompting and instruction-tuning methods as potential solutions, but find that their improvements are often unstable. To that end, we propose event temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling
MethodsALIGN
