How Good is my Video LMM? Complex Video Reasoning and Robustness   Evaluation Suite for Video-LMMs

Muhammad Uzair Khattak; Muhammad Ferjad Naeem; Jameel Hassan; Muzammal; Naseer; Federico Tombari; Fahad Shahbaz Khan; Salman Khan

arXiv:2405.03690·cs.CV·May 10, 2024

How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Jameel Hassan, Muzammal, Naseer, Federico Tombari, Fahad Shahbaz Khan, Salman Khan

PDF

Open Access

TL;DR

This paper introduces CVRR-ES, a comprehensive benchmark for evaluating Video-LMMs' reasoning and robustness in complex, real-world videos, revealing current models' limitations and proposing a prompting technique to improve performance.

Contribution

The paper presents a new evaluation suite for Video-LMMs that assesses reasoning and robustness across diverse real-world videos, and introduces a training-free prompting method to enhance existing models.

Findings

01

Most Video-LMMs struggle with robustness and reasoning in complex videos.

02

Open-source models perform worse than closed-source ones in the benchmark.

03

The proposed DSCP technique improves model performance without additional training.

Abstract

Recent advancements in Large Language Models (LLMs) have led to the development of Video Large Multi-modal Models (Video-LMMs) that can handle a wide range of video understanding tasks. These models have the potential to be deployed in real-world applications such as robotics, AI assistants, medical surgery, and autonomous vehicles. The widespread adoption of Video-LMMs in our daily lives underscores the importance of ensuring and evaluating their robust performance in mirroring human-like reasoning and interaction capabilities in complex, real-world contexts. However, existing benchmarks for Video-LMMs primarily focus on general video comprehension abilities and neglect assessing their reasoning capabilities over complex videos in the real-world context, and robustness of these models through the lens of user prompts as text queries. In this paper, we present the Complex Video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Machine Learning and Data Classification

MethodsFocus