Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory
Shunki Uebayashi, Kento Masui, Kyohei Atarashi, Han Bao, Hisashi Kashima, Naoto Inoue, Mayu Otani, and Koh Takeuchi

TL;DR
This paper introduces M3IRT, a new framework for evaluating multimodal models that accurately measures their cross-modal reasoning ability by filtering out shortcut questions, leading to more reliable benchmarks.
Contribution
M3IRT extends classical IRT to decompose model ability and item difficulty into modality-specific components, improving the quality and efficiency of multimodal benchmark evaluations.
Findings
M3IRT effectively identifies genuinely cross-modal questions.
It maintains ranking fidelity even with 50% low-quality items.
Reduces evaluation cost while enhancing reliability.
Abstract
Multimodal Large Language Models (MLLMs) have recently emerged as general architectures capable of reasoning over diverse modalities. Benchmarks for MLLMs should measure their ability for cross-modal integration. However, current benchmarks are filled with shortcut questions, which can be solved using only a single modality, thereby yielding unreliable rankings. For example, in vision-language cases, we can find the correct answer without either the image or the text. These low-quality questions unnecessarily increase the size and computational requirements of benchmarks. We introduce a multi-modal and multidimensional item response theory framework (M3IRT) that extends classical IRT by decomposing both model ability and item difficulty into image-only, text-only, and cross-modal components. M3IRT estimates cross-modal ability of MLLMs and each question's cross-modal difficulty,…
Peer Reviews
Decision·ICLR 2026 Poster
1. Both the motivation of decomposing model ability and item difficulty and the proposed M^3-IRT framework make sense and are technically sound to me. Even though it builds upon the classical IRT framework, the extension to MLLMs makes lots of sense and could provide more nuanced evaluation and improved reliability with marginal cost. 2. Extensive experiments were conducted to justify the effectiveness and efficiency of the proposed framework. 3. Writing is good and easy to follow.
1. Multiple hyper-parameters were introduced by the proposed model. It would be better to provide more discussion on: a. how accurate the estimation of these parameters based on the method in section 4.4? b. sensitivity of the hyper-parameters; c. how many data are needed? d. cost of the estimation. 2. The proposed framework mainly filters the items to be tested rather than help curate new dataset. While evaluation might be costly, we only need to run it once for every new model. With the loweri
1. The paper makes a contribution by addressing the critical challenge of "shortcut questions" in multimodal benchmarks. It provides a systematic solution that enhances the reliability of evaluations while simultaneously reducing computational costs. 2. The paper effectively targets two major pain points in current multimodal evaluation: unreliable model rankings caused by shortcut question contamination, and the high computational cost associated with large-scale benchmarks. The proposed M³-IRT
1. The model's core assumption of linear decomposition for abilities and difficulties might oversimplify the complex, potentially non-linear interactions that occur during cross-modal reasoning. 2. The interpretability of the estimated parameters such as cross-modal difficulty is derived purely from model performance patterns and lacks external validation against human cognitive judgments of what constitutes a cross-modal task. 3. It is noted that the paper generates artificial low-quality ques
1. This paper studies a very important problem in Multimodal Large Language Models, and the authors' observations about the weaknesses of existing evaluation efforts are reasonable. 2. This paper provides detailed descriptions of their evaluation framework. There are a lot of figures to visualize the results of evaluation.
1. The problem that this paper has pointed out (prior evaluation efforts use multimodal problems that can be solved with only one of the modalities) has been studied previously. This paper did not provide proper reference to these prior works. For example, MMEvalPro [1] is a recently proposed dataset with manually labeled questions to mitigate the problem. 2. The results are mostly displayed in the figures, while I expect more accurate numbers to be displayed in tables. While figures are effecti
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Intelligent Tutoring Systems and Adaptive Learning
