MME-CoF-Pro: Evaluating Reasoning Coherence in Video Generative Models with Text and Visual Hints

Yu Qi; Xinyi Xu; Ziyu Guo; Siyuan Ma; Renrui Zhang; Xinyan Chen; Ruichuan An; Ruofan Xing; Jiayi Zhang; Haojie Huang; Pheng-Ann Heng; Jonathan Tremblay; Lawson L.S. Wong

arXiv:2603.20194·cs.CV·March 23, 2026

MME-CoF-Pro: Evaluating Reasoning Coherence in Video Generative Models with Text and Visual Hints

Yu Qi, Xinyi Xu, Ziyu Guo, Siyuan Ma, Renrui Zhang, Xinyan Chen, Ruichuan An, Ruofan Xing, Jiayi Zhang, Haojie Huang, Pheng-Ann Heng, Jonathan Tremblay, Lawson L.S. Wong

PDF

Open Access

TL;DR

This paper introduces MME-CoF-Pro, a benchmark to evaluate reasoning coherence in video generative models, revealing their weak reasoning abilities and the effects of different hints on their performance.

Contribution

It presents a new comprehensive benchmark with evaluation metrics and settings to assess reasoning coherence in video models, addressing a gap in current evaluation methods.

Findings

01

Video models show weak reasoning coherence, independent of generation quality.

02

Text hints improve correctness but can cause inconsistencies and hallucinations.

03

Visual hints help with perceptual tasks but are less effective for fine-grained reasoning.

Abstract

Video generative models show emerging reasoning behaviors. It is essential to ensure that generated events remain causally consistent across frames for reliable deployment, a property we define as reasoning coherence. To bridge the gap in literature for missing reasoning coherence evaluation, we propose MME-CoF-Pro, a comprehensive video reasoning benchmark to assess reasoning coherence in video models. Specifically, MME-CoF-Pro contains 303 samples across 16 categories, ranging from visual logical to scientific reasoning. It introduces Reasoning Score as evaluation metric for assessing process-level necessary intermediate reasoning steps, and includes three evaluation settings, (a) no hint (b) text hint and (c) visual hint, enabling a controlled investigation into the underlying mechanisms of reasoning hint guidance. Evaluation results in 7 open and closed-source video models reveals…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Human Pose and Action Recognition