STORM: Benchmarking Visual Rating of MLLMs with a Comprehensive Ordinal Regression Dataset
Jinhong Wang, Shuo Tong, Jian liu, Dongqi Tang, Jintai Chen, Haochao Ying, Hongxia Xu, Danny Chen, Jian Wu

TL;DR
This paper introduces STORM, a comprehensive benchmark with datasets and evaluation protocols for assessing multi-modal large language models' ability to perform visual rating tasks across various domains, emphasizing interpretability and zero-shot capabilities.
Contribution
The work provides the first large-scale, multi-domain ordinal regression benchmark for MLLMs, along with a novel processing pipeline for trustworthy and interpretable visual rating.
Findings
MLLMs show significant room for improvement in visual rating tasks.
The proposed coarse-to-fine pipeline enhances interpretability and performance.
Fine-tuning strategies can significantly boost MLLMs' zero-shot capabilities.
Abstract
Visual rating is an essential capability of artificial intelligence (AI) for multi-dimensional quantification of visual content, primarily applied in ordinal regression (OR) tasks such as image quality assessment, facial age estimation, and medical image grading. However, current multi-modal large language models (MLLMs) under-perform in such visual rating ability while also suffering the lack of relevant datasets and benchmarks. In this work, we collect and present STORM, a data collection and benchmark for Stimulating Trustworthy Ordinal Regression Ability of MLLMs for universal visual rating. STORM encompasses 14 ordinal regression datasets across five common visual rating domains, comprising 655K image-level pairs and the corresponding carefully curated VQAs. Importantly, we also propose a coarse-to-fine processing pipeline that dynamically considers label candidates and provides…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. This work gives the first comprehensive benchmark and dataset suite specifically designed to benchmark and boost the universal visual rating capability of MLLMs across 14 ordinal-regression datasets spanning five domains. 2. A coarse-to-fine Chain-of-Thought prompting pipeline that equips MLLMs with an interpretable, hierarchical ordinal-reasoning paradigm, substantially mitigating numerical hallucination and improving zero-shot generalization. 3. Experiment session is convincing, demonstra
1. All textual labels rely on GPT-4 generation supplemented by manual fine-tuning, a process that is neither reproducible nor transparent. I understand the high cost of manual data, but this weakens the data contribution of this paper, making it resemble a "combination of existing datasets." 2. The zero-shot claim is mismatched with the evaluation settings. **The paper emphasizes "zero-shot" capability**, yet in most experiments, it uses the training set of the corresponding dataset for fine-tu
The work constructs a dataset comprising 655,000 image-question pairs, encompassing the domains of image quality, aesthetics, age estimation, medical grading, and historical period estimation. Mainstream datasets in these areas often lack textual labels; thus, the authors employed GPT to generate labels, followed by manual adjustments to ensure that the textual definitions align more closely with human scoring habits.
1. Lacks of innovation: although the Coarse-to-Fine Chain-of-Thought approach is practical, this concept is common in IQA. 2. The experiments lack comparisons with SOTA models in the field, instead only comparing with other MLLMs, making it difficult to demonstrate its performance advantages in visual scoring tasks. 3. The work does not provide detailed information on the specific ratios and standards used for the manual adjustments in generating textual labels. 4. The work utilizes Qwen2.5-VL-3
1. The submission proposes a coarse-to-fine method to improve existing MLLMs, and it shows performance gains. 2. The submission is clearly written, well-organized, and easy to follow.
1. STORM is essentially a merger of existing datasets without new annotations, which limits its contribution. The text annotations are not truly new, but just reformatted versions of existing numeric scores, so they do not add meaningful value. Overall, for a dataset-focused paper, the work mainly consists of combining prior datasets, which does not meet the expected level of novelty, effort, or complexity. 2. Despite fine-tuning, performance remains weak. Prior work (Q-Align, DeQA-Score) traine
- The primary contribution lies in providing the first large-scale, multi-domain benchmark for assessing MLLMs’ visual ordinal regression capabilities. The dataset construction is rigorous, covering 655K samples across five distinct domains (Table 1), filling a gap in this line of research. - The experimental design is relatively comprehensive, comparing against multiple baseline models (Tables 2-3), and the detailed ablations (Table 4) substantiate the effectiveness of key design choices, espec
- The paper exhibits two fundamental issues. First, the so-called “coarse-to-fine CoT” is essentially closer to hierarchical supervision rather than genuine chain-of-thought reasoning. As indicated by Figure 2 and the description on page 5, lines 269-277, the model is trained to output both coarse and fine labels, but the paper does not clearly specify whether, at inference time, there exists a true dependency from the coarse prediction to the fine prediction. The setup of the “w/o CoT” baseline
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Imaging for Blood Diseases · AI in cancer detection · Image Retrieval and Classification Techniques
