InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows

Kirolos Ataallah; Eslam Abdelrahman; Mahmoud Ahmed; Chenhui Gou; Khushbu Pahwa; Jian Ding; Mohamed Elhoseiny

arXiv:2406.19875·cs.CV·November 11, 2025

InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows

Kirolos Ataallah, Eslam Abdelrahman, Mahmoud Ahmed, Chenhui Gou, Khushbu Pahwa, Jian Ding, Mohamed Elhoseiny

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

InfiniBench is a comprehensive benchmark with extensive long-form video content and diverse questions, designed to evaluate and challenge multi-modal models' ability to understand complex, narratively rich videos.

Contribution

This paper introduces InfiniBench, the largest and most diverse benchmark for long video understanding, including extensive content, questions, and skills, to rigorously evaluate multi-modal models.

Findings

01

Models perform poorly on long video comprehension tasks.

02

Models rely heavily on metadata and world knowledge.

03

Multimodal input significantly improves model performance.

Abstract

Understanding long-form videos, such as movies and TV episodes ranging from tens of minutes to two hours, remains a significant challenge for multi-modal models. Existing benchmarks often fail to test the full range of cognitive skills needed to process these temporally rich and narratively complex inputs. Therefore, we introduce InfiniBench, a comprehensive benchmark designed to evaluate the capabilities of models in long video understanding rigorously. InfiniBench offers:(1) Over 1,000 hours of video content, with an average video length of 53 minutes. (2) The largest set of question-answer pairs for long video comprehension, totaling around 87.7 K. (3) Eight diverse skills that span both grounding-based (e.g., scene transitions, character actions) and reasoning-based (e.g., deep context understanding, multi-event linking). (4) Rich annotation formats, including both multiple-choice…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Vision-CAIR/InfiniBench
noneOfficial

Datasets

Vision-CAIR/InfiniBench
dataset· 620 dl
620 dl

Videos

InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows· underline

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis

MethodsFocus