TempCompass: Do Video LLMs Really Understand Videos?
Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li,, Sishuo Chen, Xu Sun, Lu Hou

TL;DR
The paper introduces TempCompass, a comprehensive benchmark for evaluating the temporal perception abilities of Video LLMs across diverse tasks and aspects, revealing their significant limitations in understanding video dynamics.
Contribution
It proposes a novel benchmark with diverse temporal aspects and task formats, along with innovative data collection and evaluation strategies for assessing Video LLMs.
Findings
State-of-the-art Video LLMs perform poorly on temporal perception tasks.
The benchmark reveals models' inability to distinguish different temporal aspects.
Evaluation shows significant room for improvement in Video LLM temporal understanding.
Abstract
Recently, there is a surge in interest surrounding video large language models (Video LLMs). However, existing benchmarks fail to provide a comprehensive feedback on the temporal perception ability of Video LLMs. On the one hand, most of them are unable to distinguish between different temporal aspects (e.g., speed, direction) and thus cannot reflect the nuanced performance on these specific aspects. On the other hand, they are limited in the diversity of task formats (e.g., only multi-choice QA), which hinders the understanding of how temporal perception performance may vary across different types of tasks. Motivated by these two problems, we propose the \textbf{TempCompass} benchmark, which introduces a diversity of temporal aspects and task formats. To collect high-quality test data, we devise two novel strategies: (1) In video collection, we construct conflicting videos that share…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization
