Rethinking Ground Truth: A Case Study on Human Label Variation in MLLM Benchmarking

Tomas Ruiz; Tanalp Agustoslu; Carsten Schwemmer

arXiv:2603.19744·cs.CL·March 23, 2026·IEEE Big Data

Rethinking Ground Truth: A Case Study on Human Label Variation in MLLM Benchmarking

Tomas Ruiz, Tanalp Agustoslu, Carsten Schwemmer

PDF

Open Access

TL;DR

This paper investigates human label variation in MLLM benchmarking, proposing an evaluation protocol that considers both agreement and disagreement among annotators to provide more realistic model assessments.

Contribution

It introduces a new benchmarking protocol for MLLMs that explicitly accounts for human label variation, highlighting the limitations of consensus-based evaluations.

Findings

01

Larger models perform better on high-agreement tasks.

02

Medium models often underperform when human disagreement is high.

03

Benchmarking solely on consensus labels can overestimate model capabilities.

Abstract

Human Label Variation (HLV), i.e. systematic differences among annotators' judgments, remains underexplored in benchmarks despite rapid progress in large language model (LLM) development. We address this gap by introducing an evaluation protocol for multimodal large language model (MLLM) benchmarking that explicitly accounts for two conditions: (1) human label agreement and (2) disagreement. We apply this protocol to two state-of-the-art MLLM families (Gemma 3, Qwen 2.5 VL) using non-aggregated human annotations from a social media content classification dataset. Across tasks, we find that larger models tend to perform best on high-agreement subsets, yet often underperform medium-sized models when human disagreement is high, indicating that parameter count alone does not determine sensitivity to ambiguity and subjectivity. These results show that benchmarks based solely on consensus…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Authorship Attribution and Profiling