ViRectify: A Challenging Benchmark for Video Reasoning Correction with Multimodal Large Language Models

Xusen Hei; Jiali Chen; Jinyu Yang; Mengchen Zhao; Yi Cai

arXiv:2512.01424·cs.CV·December 5, 2025

ViRectify: A Challenging Benchmark for Video Reasoning Correction with Multimodal Large Language Models

Xusen Hei, Jiali Chen, Jinyu Yang, Mengchen Zhao, Yi Cai

PDF

Open Access

TL;DR

ViRectify is a new benchmark designed to evaluate and improve multimodal large language models' ability to identify and correct complex video reasoning errors through step-wise error correction and evidence grounding.

Contribution

We introduce ViRectify, a comprehensive dataset and correction framework that enables detailed evaluation and enhancement of MLLMs' video reasoning correction capabilities.

Findings

01

GPT-5 achieves 31.94% correction accuracy on ViRectify.

02

Qwen2.5-VL-7B outperforms 72B variants on the benchmark.

03

The framework reveals systematic asymmetries in error correction across models.

Abstract

As multimodal large language models (MLLMs) frequently exhibit errors in complex video reasoning scenarios, correcting these errors is critical for uncovering their weaknesses and improving performance. However, existing benchmarks lack systematic evaluation of MLLMs' ability to identify and correct these video reasoning errors. To bridge this gap, we propose ViRectify, a comprehensive benchmark to evaluate their fine-grained correction capability. Through an AI-assisted annotation pipeline with human verification, we construct a dataset of over 30K instances spanning dynamic perception, scientific reasoning, and embodied decision-making domains. In ViRectify, we challenge MLLMs to perform step-wise error identification and generate rationales with key video evidence grounding. In addition, we further propose the trajectory evidence-driven correction framework, comprising step-wise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis