$M^3-Verse$: A "Spot the Difference" Challenge for Large Multimodal Models

Kewei Wei; Bocheng Hu; Jie Cao; Xiaohan Chen; Zhengxi Lu; Wubing Xia; Weili Xu; Jiaao Wu; Junchen He; Mingyu Jia; Ciyun Zhao; Ye Sun; Yizhi Li; Zhonghan Zhao; Jian Zhang; Gaoang Wang

arXiv:2512.18735·cs.CV·December 23, 2025

$M^3-Verse$: A "Spot the Difference" Challenge for Large Multimodal Models

Kewei Wei, Bocheng Hu, Jie Cao, Xiaohan Chen, Zhengxi Lu, Wubing Xia, Weili Xu, Jiaao Wu, Junchen He, Mingyu Jia, Ciyun Zhao, Ye Sun, Yizhi Li, Zhonghan Zhao, Jian Zhang, Gaoang Wang

PDF

Open Access

TL;DR

This paper introduces $M^3-Verse$, a comprehensive benchmark for evaluating large multimodal models' ability to understand dynamic object changes across videos, revealing current limitations and proposing a baseline for improvement.

Contribution

The paper presents a new benchmark, $M^3-Verse$, for assessing models' understanding of object transformations in videos, and evaluates existing models, highlighting their shortcomings and proposing a simple baseline.

Findings

01

Existing LMMs struggle with tracking state transitions.

02

$M^3-Verse$ contains 270 scenes and 2,932 questions.

03

A baseline improves multi-state perception performance.

Abstract

Modern Large Multimodal Models (LMMs) have demonstrated extraordinary ability in static image and single-state spatial-temporal understanding. However, their capacity to comprehend the dynamic changes of objects within a shared spatial context between two distinct video observations, remains largely unexplored. This ability to reason about transformations within a consistent environment is particularly crucial for advancements in the field of spatial intelligence. In this paper, we introduce $M^{3} - V er se$ , a Multi-Modal, Multi-State, Multi-Dimensional benchmark, to formally evaluate this capability. It is built upon paired videos that provide multi-perspective observations of an indoor scene before and after a state change. The benchmark contains a total of 270 scenes and 2,932 questions, which are categorized into over 50 subtasks that probe 4 core capabilities. We evaluate 16…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Surveillance and Tracking Methods