MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark

Shaden Shaar; Bradon Thymes; Sirawut Chaixanien; Claire Cardie; Bharath Hariharan

arXiv:2601.02536·cs.CV·April 1, 2026

MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark

Shaden Shaar, Bradon Thymes, Sirawut Chaixanien, Claire Cardie, Bharath Hariharan

PDF

1 Datasets

TL;DR

This paper introduces MovieRecapsQA, a novel open-ended multimodal VideoQA benchmark using movie recap videos to evaluate models' reasoning across visual and dialogue cues.

Contribution

It creates the first reference-free open-ended VideoQA benchmark with multiple input settings and detailed question modality categorization.

Findings

01

Reference-free metric aligns well with human judgment.

02

Vision questions are the most challenging for models.

03

Removing visual input can sometimes improve factual accuracy.

Abstract

Understanding real-world videos such as movies requires integrating visual and dialogue cues. Yet existing VideoQA benchmarks struggle to capture this multimodal reasoning and, given the difficulty of evaluating free-form answers, largely resort to simple multiple choice questions. We introduce a novel open-ended multimodal VideoQA benchmark, MovieRecapsQA, created using movie recap videos -- a distinctive type of YouTube content that summarizes a film via a voiceover description of key clips from the movie (recap video). From the transcribed voiceover (recap summary) of 60 recap videos, we generate $\approx$ 8.2K questions along with the necessary ``facts'' expected in each answer; the former facilitates the creation of questions that require mutimodal reasoning and the latter allow the construction of a reference-free evaluation metric that can be applied to open-ended responses. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

sshaar/movierecapsqa
dataset· 113 dl
113 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.