Rethinking Ground Truth: A Case Study on Human Label Variation in MLLM Benchmarking
Tomas Ruiz, Tanalp Agustoslu, Carsten Schwemmer

TL;DR
This paper investigates human label variation in MLLM benchmarking, proposing an evaluation protocol that considers both agreement and disagreement among annotators to provide more realistic model assessments.
Contribution
It introduces a new benchmarking protocol for MLLMs that explicitly accounts for human label variation, highlighting the limitations of consensus-based evaluations.
Findings
Larger models perform better on high-agreement tasks.
Medium models often underperform when human disagreement is high.
Benchmarking solely on consensus labels can overestimate model capabilities.
Abstract
Human Label Variation (HLV), i.e. systematic differences among annotators' judgments, remains underexplored in benchmarks despite rapid progress in large language model (LLM) development. We address this gap by introducing an evaluation protocol for multimodal large language model (MLLM) benchmarking that explicitly accounts for two conditions: (1) human label agreement and (2) disagreement. We apply this protocol to two state-of-the-art MLLM families (Gemma 3, Qwen 2.5 VL) using non-aggregated human annotations from a social media content classification dataset. Across tasks, we find that larger models tend to perform best on high-agreement subsets, yet often underperform medium-sized models when human disagreement is high, indicating that parameter count alone does not determine sensitivity to ambiguity and subjectivity. These results show that benchmarks based solely on consensus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Authorship Attribution and Profiling
