Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

Junhao Cheng; Yuying Ge; Teng Wang; Yixiao Ge; Jing Liao; Ying Shan

arXiv:2505.21374·cs.CV·May 28, 2025

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, Ying Shan

PDF

Open Access 1 Repo 1 Datasets

TL;DR

Video-Holmes introduces a new benchmark inspired by Sherlock Holmes to evaluate complex video reasoning in multimodal models, revealing current models' struggles with integrating clues and reasoning like humans.

Contribution

The paper presents Video-Holmes, a novel benchmark with 1,837 questions from suspense films, designed to assess deep reasoning and clue integration in multimodal models.

Findings

01

State-of-the-art models perform poorly on complex reasoning tasks.

02

Most models excel at perception but struggle with clue integration.

03

The best model achieves only 45% accuracy.

Abstract

Recent advances in CoT reasoning and RL post-training have been reported to enhance video reasoning capabilities of MLLMs. This progress naturally raises a question: can these models perform complex video reasoning in a manner comparable to human experts? However, existing video benchmarks primarily evaluate visual perception and grounding abilities, with questions that can be answered based on explicit prompts or isolated visual cues. Such benchmarks do not fully capture the intricacies of real-world reasoning, where humans must actively search for, integrate, and analyze multiple clues before reaching a conclusion. To address this issue, we present Video-Holmes, a benchmark inspired by the reasoning process of Sherlock Holmes, designed to evaluate the complex video reasoning capabilities of MLLMs. Video-Holmes consists of 1,837 questions derived from 270 manually annotated suspense…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tencentarc/video-holmes
pytorchOfficial

Datasets

TencentARC/Video-Holmes
dataset· 546 dl
546 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI)