Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models

Masanari Ohi; Masahiro Kaneko; Naoaki Okazaki; Nakamasa Inoue

arXiv:2412.14613·cs.CL·March 10, 2026

Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models

Masanari Ohi, Masahiro Kaneko, Naoaki Okazaki, Nakamasa Inoue

PDF

Open Access

TL;DR

HarmonicEval is a new multi-criteria, reference-free evaluation metric for vision-language models that better aligns with human judgments across multiple tasks, supported by a large human-annotated benchmark.

Contribution

The paper introduces HarmonicEval, a novel multi-criteria evaluation metric, and the MMHE benchmark, enabling more accurate and adaptable assessment of vision-language models across diverse tasks.

Findings

01

HarmonicEval outperforms traditional metrics in correlating with human judgments.

02

The MMHE benchmark includes 18,000 expert annotations across four tasks.

03

HarmonicEval provides criterion-wise scores, enhancing interpretability.

Abstract

Vision-language models (VLMs) have shown impressive abilities across a range of multi-modal tasks. However, existing metrics for evaluating the quality of text generated by VLMs typically focus on an overall evaluation for a specific task, such as image captioning. While the overall evaluation is essential for any task, the criteria prioritized can differ depending on the task, making it challenging for current metrics to adapt to multi-task scenarios. To address this limitation, we propose HarmonicEval, a reference-free comprehensive evaluation metric that aggregates criterion-wise scores to produce the overall score in a bottom-up manner. Furthermore, to assess the generalizability of automatic evaluation metrics in multi-task scenarios, we construct the Multi-task Multi-criteria Human Evaluation (MMHE) benchmark, which comprises 18,000 expert human judgments across four multi-modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques

MethodsFocus