Exposing and Mitigating Calibration Biases and Demographic Unfairness in MLLM Few-Shot In-Context Learning for Medical Image Classification

Xing Shen; Justin Szeto; Mingyang Li; Hengguan Huang; Tal Arbel

arXiv:2506.23298·eess.IV·July 22, 2025

Exposing and Mitigating Calibration Biases and Demographic Unfairness in MLLM Few-Shot In-Context Learning for Medical Image Classification

Xing Shen, Justin Szeto, Mingyang Li, Hengguan Huang, Tal Arbel

PDF

Open Access

TL;DR

This paper investigates calibration biases and demographic unfairness in multimodal large language models for medical image classification and introduces CALIN, a calibration method that improves fairness and accuracy across diverse demographic groups.

Contribution

The study is the first to analyze calibration biases and demographic fairness in MLLMs for medical imaging and proposes CALIN, a novel inference-time calibration technique to mitigate these biases.

Findings

01

CALIN improves fairness in confidence scores across demographic groups.

02

CALIN enhances overall prediction accuracy with minimal fairness-utility trade-off.

03

Experimental results on three datasets demonstrate CALIN's effectiveness in real-world scenarios.

Abstract

Multimodal large language models (MLLMs) have enormous potential to perform few-shot in-context learning in the context of medical image analysis. However, safe deployment of these models into real-world clinical practice requires an in-depth analysis of the accuracies of their predictions, and their associated calibration errors, particularly across different demographic subgroups. In this work, we present the first investigation into the calibration biases and demographic unfairness of MLLMs' predictions and confidence scores in few-shot in-context learning for medical image classification. We introduce CALIN, an inference-time calibration method designed to mitigate the associated biases. Specifically, CALIN estimates the amount of calibration needed, represented by calibration matrices, using a bi-level procedure: progressing from the population level to the subgroup level prior to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI · Artificial Intelligence in Healthcare and Education