InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows
Kirolos Ataallah, Eslam Abdelrahman, Mahmoud Ahmed, Chenhui Gou, Khushbu Pahwa, Jian Ding, Mohamed Elhoseiny

TL;DR
InfiniBench is a comprehensive benchmark with extensive long-form video content and diverse questions, designed to evaluate and challenge multi-modal models' ability to understand complex, narratively rich videos.
Contribution
This paper introduces InfiniBench, the largest and most diverse benchmark for long video understanding, including extensive content, questions, and skills, to rigorously evaluate multi-modal models.
Findings
Models perform poorly on long video comprehension tasks.
Models rely heavily on metadata and world knowledge.
Multimodal input significantly improves model performance.
Abstract
Understanding long-form videos, such as movies and TV episodes ranging from tens of minutes to two hours, remains a significant challenge for multi-modal models. Existing benchmarks often fail to test the full range of cognitive skills needed to process these temporally rich and narratively complex inputs. Therefore, we introduce InfiniBench, a comprehensive benchmark designed to evaluate the capabilities of models in long video understanding rigorously. InfiniBench offers:(1) Over 1,000 hours of video content, with an average video length of 53 minutes. (2) The largest set of question-answer pairs for long video comprehension, totaling around 87.7 K. (3) Eight diverse skills that span both grounding-based (e.g., scene transitions, character actions) and reasoning-based (e.g., deep context understanding, multi-event linking). (4) Rich annotation formats, including both multiple-choice…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
MethodsFocus
