Beyond Pixel Agreement: Large Language Models as Clinical Guardrails for Reliable Medical Image Segmentation
Jiaxi Sheng, Leyi Yu, Haoyue Li, Yifan Gao, and Xin Gao

TL;DR
This paper presents HCR, a novel framework using Large Language Models as clinical guardrails to reliably assess medical image segmentation quality, surpassing traditional pixel-based metrics and providing interpretable reasoning.
Contribution
Introduces HCR, a multistage prompting strategy that enables LLMs to evaluate segmentation quality in medical imaging with high accuracy and interpretability, without task-specific training.
Findings
HCR achieved 78.12% classification accuracy.
HCR outperformed ResNet50 in certain tasks.
Provides interpretable, step-by-step reasoning for assessments.
Abstract
Evaluating AI-generated medical image segmentations for clinical acceptability poses a significant challenge, as traditional pixelagreement metrics often fail to capture true diagnostic utility. This paper introduces Hierarchical Clinical Reasoner (HCR), a novel framework that leverages Large Language Models (LLMs) as clinical guardrails for reliable, zero-shot quality assessment. HCR employs a structured, multistage prompting strategy that guides LLMs through a detailed reasoning process, encompassing knowledge recall, visual feature analysis, anatomical inference, and clinical synthesis, to evaluate segmentations. We evaluated HCR on a diverse dataset across six medical imaging tasks. Our results show that HCR, utilizing models like Gemini 2.5 Flash, achieved a classification accuracy of 78.12%, performing comparably to, and in instances exceeding, dedicated vision models such as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiomics and Machine Learning in Medical Imaging · AI in cancer detection
