VRR-QA: Visual Relational Reasoning in Videos Beyond Explicit Cues
Sirnam Swetha, Rohit Gupta, Parth Parag Kulkarni, David G Shatwell, Jeffrey A Chan Santiago, Nyle Siddiqui, Joseph Fioresi, Mubarak Shah

TL;DR
VRR-QA introduces a new VideoQA benchmark focusing on implicit reasoning in creative videos, revealing models' struggles with understanding beyond explicit visual cues.
Contribution
The paper presents VRR-QA, a novel dataset and framework for evaluating visual relational reasoning beyond explicit cues in videos.
Findings
Models perform significantly worse on VRR-QA compared to human baselines.
Even top models only achieve 64% accuracy, indicating high difficulty.
Performance varies across models, highlighting diverse reasoning challenges.
Abstract
Video Question Answering (VideoQA) has made significant strides by leveraging multimodal learning to align visual and textual modalities. However, current benchmarks overwhelmingly focus on questions answerable through explicit visual content - actions, objects, and events - directly observable within individual frames or short clips. To truly understand videos as humans do, models must go beyond what is directly shown, inferring hidden relationships and contextual cues that are only implied across frames. Current benchmarks fail to capture this essential aspect of video understanding. To address this gap, we introduce VRR-QA, a benchmark for Visual Relational Reasoning Beyond Explicit Cues. We curate our benchmark from creative and cinematic videos such as movies, that deliberately employ storytelling techniques which omit direct depictions of certain events or relations, requiring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
