TL;DR
This paper introduces VisJudge-Bench, a comprehensive benchmark for evaluating AI models' ability to assess visualization quality and aesthetics, revealing current models' gaps and proposing a specialized model, VisJudge, to improve performance.
Contribution
It presents the first benchmark for visualization assessment and a new model, VisJudge, that outperforms existing large language models in judging visualization quality.
Findings
MLLMs still lag behind humans in visualization assessment.
VisJudge reduces MAE by 23.9% compared to GPT-5.
VisJudge increases correlation with human ratings by 60.5%.
Abstract
Visualization, a domain-specific yet widely used form of imagery, is an effective way to turn complex datasets into intuitive insights, and its value depends on whether data are faithfully represented, clearly communicated, and aesthetically designed. However, evaluating visualization quality is challenging: unlike natural images, it requires simultaneous judgment across data encoding accuracy, information expressiveness, and visual aesthetics. Although multimodal large language models (MLLMs) have shown promising performance in aesthetic assessment of natural images, no systematic benchmark exists for measuring their capabilities in evaluating visualizations. To address this, we propose VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs' performance in assessing visualization aesthetics and quality. It contains 3,090 expert-annotated samples from real-world…
Peer Reviews
Decision·ICLR 2026 Poster
1. **Timely problem:** The paper addresses a relevant and underexplored challenge—evaluating MLLMs on data visualizations, which differ significantly from general images and remain difficult for current models. 2. **Well-motivated framework:** The Fidelity–Expressiveness–Aesthetics framework is grounded in visualization theory and provides a clearer structure than using a single quality score. 3. **Solid experiments:** The benchmarking results across multiple MLLMs offer useful
1. Fundamental Assumption of a Singular "Ground Truth": The primary and most significant weakness of this work stems from a fundamental assumption that a singular, objective "ground truth" for visualization quality exists and can be established. This is particularly contentious for subjective dimensions like Aesthetics and even Expressiveness, where human preferences are known to be diverse, context-dependent, and often lack a single consensus. The paper's methodology—collecting scores from thre
1. The topic is interesting and the task is practical in real-world applications. 2. The authors constructed VISJUDGE-BENCH, a comprehensive benchmark based on the principles of Fidelity, Expressiveness, and Aesthetics to evaluate MLLMs’ visualization assessment capabilities. 3. They systematically evaluated representative MLLMs and found significant gaps compared with human expert standards.
1. Annotators are mainly presented with specific questions and asked to do scoring tasks. Are they asked to do enough free-form critique? It may be beneficial to diversity and generalizability. 2. While the construction of the benchmark dataset is well illustrated, I am very confused about the use of it. In Section 4, authors claimed: "To validate VisJudge-Bench as an effective training resource...". I am confused with the authors' purpose. (1) Given that its called 'VisJudge-Bench' and referr
Originality: It proposes VISJUDGE-BENCH, the first MLLM visualization evaluation benchmark covering the three dimensions of "Data Fidelity-Expressiveness-Aesthetics", filling the gap where existing benchmarks only focus on a single dimension. Through GRPO reinforcement learning and LoRA fine-tuning, it achieves significant alignment between MLLMs and human experts in visualization evaluation for the first time. Quality: The benchmark construction undergoes rigorous multi-stage screening and mult
The paper only uses 3 annotators, which may pose a risk of concentrated subjective bias, and the elimination of abnormal scores may lead to insufficient valid data volume. The 7 MLLMs tested in the experiment do not cover gradient comparison of models with different parameter scales, and the only open-source model included is Qwen2.5-VL-7B, making it difficult to verify the adaptability of VISJUDGE's fine-tuning strategy and the universality of the benchmark in the open-source ecosystem.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
