VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation
Wentao Ma, Weiming Ren, Yiming Jia, Zhuofeng Li, Ping Nie, Ge Zhang, Wenhu Chen

TL;DR
VideoEval-Pro introduces a realistic benchmark for long video understanding that emphasizes open-ended questions, revealing limitations of current models and benchmarks, and providing a more faithful assessment of their true capabilities.
Contribution
It proposes VideoEval-Pro, a new benchmark with open-ended questions that better evaluate long video understanding and addresses flaws in existing multiple-choice based benchmarks.
Findings
Video LMMs perform significantly worse on open-ended questions.
Higher MCQ scores do not correlate with open-ended question performance.
Increasing input frames improves performance more on VideoEval-Pro than on other benchmarks.
Abstract
Large multimodal models (LMMs) have recently emerged as a powerful tool for long video understanding (LVU), prompting the development of standardized LVU benchmarks to evaluate their performance. However, our investigation reveals a rather sober lesson for existing LVU benchmarks. First, most existing benchmarks rely heavily on multiple-choice questions (MCQs), whose evaluation results are inflated due to the possibility of guessing the correct answer; Second, a significant portion of questions in these benchmarks have strong priors to allow models to answer directly without even reading the input video. For example, Gemini-1.5-Pro can achieve over 50\% accuracy given a random frame from a long video on Video-MME. We also observe that increasing the number of frames does not necessarily lead to improvement on existing benchmarks, which is counterintuitive. As a result, the validity and…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
• Clear problem diagnosis. MCQ guessability + strong priors can inflate LVU scores; the benchmark explicitly targets these issues. • Open-ended reformulation. Short answers (avg 2.1 words) reduce option-structure shortcuts and force models to rely on video content. • Thoughtful filtering. Multi-stage filters (duration ≥10 min, concise answers, answerability checks, one-frame difficulty screen) aim to remove trivial/option-dependent items. • Task taxonomy. Separation into Local vs Holistic and
1. **Limited novelty in data:** The work does not introduce new videos or fresh human annotations; it repackages prior benchmarks by dropping distractors and keeping the correct option as the gold. The motivation for “MCQ → open-ended” is insufficiently argued: beyond showing a drop in accuracy, the paper does not establish that the open-ended format (with 2-word answers and an LLM judge) is a more faithful measure of LVU rather than a stricter or noisier one. A clearer goal/assumption–evidence
The motivation to filter out existing benchmarks is good, as is the use of open-ended questions instead of multiple-choice ones. However, I’m sorry to say that I really cannot find enough strengths in this paper.
First, the two key points of this benchmark—robustness and realism—are not clearly justified in the paper. For example, it is unclear why this benchmark is considered more robust. Is it because of the performance drop caused by changing multiple-choice questions (MCQs) into open-ended ones? Second, the main diagram only shows that “VIDEOEVAL-PRO cannot be effectively solved with a single input frame, and performance scales consistently with more frames.” However, the authors should compare thei
1. The paper is well motivated and written. Especially the problem of open-ended generation v/s MCQ-based evaluation is a timely and interesting one to study. 2. The paper conducts comprehensive core evaluation experiments with many different open-source and proprietary models.
While I like the motivation, I think the paper in its current form falls short in establishing the core usefulness of their proposed benchmark as well as improving our understanding of why models struggle with open-ended generation v/s MCQ-based evaluation. Please see the Questions for more weaknesses, details and specific questions.
1) By converting items from four established LVU sets into open-ended, short-answer prompts and filtering out single-frame-solvable questions, the benchmark squarely tests temporal evidence use rather than option-elimination heuristics. The head-to-head comparison (same items: MCQ vs open-ended) shows >25-point drops and rank inversions, which provides great insights. 2) The multi-stage curation (remove short videos, drop low-answerability and prior-driven items) produces a set of 1,289 QA over
1)Section 3.3 of the paper states that frames are “uniformly sampled” at a fixed count for each model evaluation, but it does not analyze whether this sampling strategy is optimal or fair across heterogeneous video types. for eg. Action-heavy videos might require denser temporal sampling to capture relevant cues, while dialogue-heavy or static-scene videos might be well represented by sparser frames emphasizing semantic rather than motion information. The paper’s results (e.g., the monotonic fra
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
