MMGR: Multi-Modal Generative Reasoning

Zefan Cai; Haoyi Qiu; Tianyi Ma; Haozhe Zhao; Gengze Zhou; Kung-Hsiang Huang; Parisa Kordjamshidi; Minjia Zhang; Wen Xiao; Jiuxiang Gu; Nanyun Peng; Junjie Hu

arXiv:2512.14691·cs.CL·December 18, 2025

MMGR: Multi-Modal Generative Reasoning

Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou, Kung-Hsiang Huang, Parisa Kordjamshidi, Minjia Zhang, Wen Xiao, Jiuxiang Gu, Nanyun Peng, Junjie Hu

PDF

Open Access

TL;DR

MMGR introduces a comprehensive evaluation framework for generative video and image models, focusing on reasoning abilities like physics, logic, and spatial understanding, revealing significant gaps in current models' reasoning capabilities.

Contribution

The paper presents MMGR, a novel benchmarking framework that assesses reasoning skills in generative models across multiple domains, emphasizing holistic correctness beyond perceptual quality.

Findings

01

Current models perform poorly on abstract reasoning tasks.

02

Models show moderate success on physical commonsense tasks.

03

Significant gaps in reasoning abilities highlight the need for improved generative models.

Abstract

Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)