Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring
Jiaqing Zhang, Sandeep Elluri, Bhanu Cherukuvada, Yonah Joffe, Jessica Sena, Miguel Contreras, Scott Siegel, Subhash Nerella, Catherine Price, Parisa Rashidi

TL;DR
This study evaluates multimodal LLMs for clinical scoring, revealing a central tendency bias that compresses predictions toward the middle of the scale, affecting critical score extremes.
Contribution
It benchmarks LLMs against deep learning models for clinical scoring and uncovers a systematic bias toward central scores, highlighting calibration issues.
Findings
Vision Transformers achieve best calibration with MAE 0.52
Zero-shot LLMs are competitive in agreement metrics
All LLMs show a central tendency bias towards middle scores
Abstract
Multimodal large language models (LLMs) are increasingly explored as automated evaluators in clinical settings, yet their scoring behavior on ordinal clinical scales remains poorly understood. We benchmark three frontier LLM families against supervised deep learning models for scoring Clock Drawing Test (CDT) images on two public datasets using the Shulman rubric. While fully fine-tuned Vision Transformers achieve the best calibration (MAE 0.52, within-1 accuracy 91%), zero-shot LLMs remain competitive on tolerance-based agreement (GPT-5 MAE 0.67, within-1 accuracy 92%) despite higher absolute error. However, per-score analysis reveals that all three LLM families exhibit a pronounced central tendency effect (systematic endpoint compression): predictions are systematically compressed toward the middle of the scale, with over-prediction at the low end (score 0 to 1) and under-prediction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
