Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?

Md Tahmid Rahman Laskar; Mohammed Saidul Islam; Ridwan Mahbub; Ahmed Masry; Mizanur Rahman; Amran Bhuiyan; Mir Tafseer Nayeem; Shafiq Joty; Enamul Hoque; Jimmy Huang

arXiv:2505.08468·cs.CL·July 8, 2025

Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?

Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub, Ahmed Masry, Mizanur Rahman, Amran Bhuiyan, Mir Tafseer Nayeem, Shafiq Joty, Enamul Hoque, Jimmy Huang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper evaluates 13 open-source large vision-language models as cost-effective judges for chart comprehension and reasoning tasks, revealing variability in performance and biases like positional and length preferences.

Contribution

It introduces a standardized protocol for assessing open-source LVLMs as automatic evaluators for chart understanding, highlighting their potential and limitations.

Findings

01

Some LVLM judges achieve up to 80% agreement with GPT-4.

02

Performance varies significantly across models.

03

Biases such as positional preference and length bias are observed.

Abstract

Charts are ubiquitous as they help people understand and reason with data. Recently, various downstream tasks, such as chart question answering, chart2text, and fact-checking, have emerged. Large Vision-Language Models (LVLMs) show promise in tackling these tasks, but their evaluation is costly and time-consuming, limiting real-world deployment. While using LVLMs as judges to assess the chart comprehension capabilities of other LVLMs could streamline evaluation processes, challenges like proprietary datasets, restricted access to powerful models, and evaluation costs hinder their adoption in industrial settings. To this end, we present a comprehensive evaluation of 13 open-source LVLMs as judges for diverse chart comprehension and reasoning tasks. We design both pairwise and pointwise evaluation tasks covering criteria like factual correctness, informativeness, and relevancy.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tahmedge/chart_lvlm_judge
noneOfficial

Videos

Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?· underline

Taxonomy

TopicsNatural Language Processing Techniques · Artificial Intelligence in Law

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Dropout · Layer Normalization · Byte Pair Encoding · Softmax · Absolute Position Encodings · Residual Connection