Two Causally Related Needles in a Video Haystack
Miaoyu Li, Qin Chao, Boyang Li

TL;DR
This paper introduces Causal2Needles, a new benchmark for evaluating long-video understanding in Video-Language Models, focusing on extracting and relating two separate pieces of information and modeling cause-effect relationships.
Contribution
The paper presents a novel benchmark with diverse question types that challenge models to understand long videos and causal relationships, revealing limitations of current models.
Findings
Models perform poorly on causal 2-needle questions.
Performance decreases as the distance between needles increases.
Current VLMs struggle with causal and long-context understanding.
Abstract
Properly evaluating the ability of Video-Language Models (VLMs) to understand long videos remains a challenge. We propose a long-context video understanding benchmark, Causal2Needles, that assesses two crucial abilities insufficiently addressed by existing benchmarks: (1) extracting information from two separate locations (two needles) in a long video and understanding them jointly, and (2) modeling the world in terms of cause and effect in human behaviors. Causal2Needles evaluates these abilities using noncausal one-needle, causal one-needle, and causal two-needle questions. The most complex question type, causal two-needle questions, require extracting information from both the cause and effect events from a long video and the associated narration text. To prevent textual bias, we introduce two complementary question formats: locating the video clip containing the answer, and verbal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPhilosophy and History of Science
MethodsContrastive Language-Image Pre-training
