Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring

Jiaqing Zhang; Sandeep Elluri; Bhanu Cherukuvada; Yonah Joffe; Jessica Sena; Miguel Contreras; Scott Siegel; Subhash Nerella; Catherine Price; Parisa Rashidi

arXiv:2605.16386·cs.CV·May 19, 2026

Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring

Jiaqing Zhang, Sandeep Elluri, Bhanu Cherukuvada, Yonah Joffe, Jessica Sena, Miguel Contreras, Scott Siegel, Subhash Nerella, Catherine Price, Parisa Rashidi

PDF

TL;DR

This study evaluates multimodal LLMs for clinical scoring, revealing a central tendency bias that compresses predictions toward the middle of the scale, affecting critical score extremes.

Contribution

It benchmarks LLMs against deep learning models for clinical scoring and uncovers a systematic bias toward central scores, highlighting calibration issues.

Findings

01

Vision Transformers achieve best calibration with MAE 0.52

02

Zero-shot LLMs are competitive in agreement metrics

03

All LLMs show a central tendency bias towards middle scores

Abstract

Multimodal large language models (LLMs) are increasingly explored as automated evaluators in clinical settings, yet their scoring behavior on ordinal clinical scales remains poorly understood. We benchmark three frontier LLM families against supervised deep learning models for scoring Clock Drawing Test (CDT) images on two public datasets using the Shulman rubric. While fully fine-tuned Vision Transformers achieve the best calibration (MAE 0.52, within-1 accuracy 91%), zero-shot LLMs remain competitive on tolerance-based agreement (GPT-5 MAE 0.67, within-1 accuracy 92%) despite higher absolute error. However, per-score analysis reveals that all three LLM families exhibit a pronounced central tendency effect (systematic endpoint compression): predictions are systematically compressed toward the middle of the scale, with over-prediction at the low end (score 0 to 1) and under-prediction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.