VRR-QA: Visual Relational Reasoning in Videos Beyond Explicit Cues

Sirnam Swetha; Rohit Gupta; Parth Parag Kulkarni; David G Shatwell; Jeffrey A Chan Santiago; Nyle Siddiqui; Joseph Fioresi; Mubarak Shah

arXiv:2506.21742·cs.CV·March 31, 2026

VRR-QA: Visual Relational Reasoning in Videos Beyond Explicit Cues

Sirnam Swetha, Rohit Gupta, Parth Parag Kulkarni, David G Shatwell, Jeffrey A Chan Santiago, Nyle Siddiqui, Joseph Fioresi, Mubarak Shah

PDF

2 Repos 1 Datasets

TL;DR

VRR-QA introduces a new VideoQA benchmark focusing on implicit reasoning in creative videos, revealing models' struggles with understanding beyond explicit visual cues.

Contribution

The paper presents VRR-QA, a novel dataset and framework for evaluating visual relational reasoning beyond explicit cues in videos.

Findings

01

Models perform significantly worse on VRR-QA compared to human baselines.

02

Even top models only achieve 64% accuracy, indicating high difficulty.

03

Performance varies across models, highlighting diverse reasoning challenges.

Abstract

Video Question Answering (VideoQA) has made significant strides by leveraging multimodal learning to align visual and textual modalities. However, current benchmarks overwhelmingly focus on questions answerable through explicit visual content - actions, objects, and events - directly observable within individual frames or short clips. To truly understand videos as humans do, models must go beyond what is directly shown, inferring hidden relationships and contextual cues that are only implied across frames. Current benchmarks fail to capture this essential aspect of video understanding. To address this gap, we introduce VRR-QA, a benchmark for Visual Relational Reasoning Beyond Explicit Cues. We curate our benchmark from creative and cinematic videos such as movies, that deliberately employ storytelling techniques which omit direct depictions of certain events or relations, requiring…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

ucf-crcv/ImplicitQA
dataset· 133 dl
133 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.