DeVisE: Behavioral Testing of Medical Large Language Models

Camila Zurdo Tagliabue; Heloisa Oss Boll; Aykut Erdem; Erkut Erdem; Iacer Calixto

arXiv:2506.15339·cs.CL·February 27, 2026

DeVisE: Behavioral Testing of Medical Large Language Models

Camila Zurdo Tagliabue, Heloisa Oss Boll, Aykut Erdem, Erkut Erdem, Iacer Calixto

PDF

Open Access

TL;DR

DeVisE is a behavioral testing framework that assesses medical LLMs' understanding by analyzing their responses to controlled counterfactual changes in clinical data, revealing differences in reasoning and sensitivity.

Contribution

We introduce DeVisE, a novel framework for probing fine-grained clinical reasoning in medical LLMs using counterfactuals and analyze model behaviors in ICU data.

Findings

01

Models show varied sensitivity to demographic and vital sign changes.

02

Standard metrics often miss clinically relevant behavioral differences.

03

Medical LLMs differ significantly in their response consistency.

Abstract

Large language models (LLMs) are increasingly applied in clinical decision support, yet current evaluations rarely reveal whether their outputs reflect genuine medical reasoning or superficial correlations. We introduce DeVisE (Demographics and Vital signs Evaluation), a behavioral testing framework that probes fine-grained clinical understanding through controlled counterfactuals. Using intensive care unit (ICU) discharge notes from MIMIC-IV, we construct both raw (real-world) and template-based (synthetic) variants with single-variable perturbations in demographic (age, gender, ethnicity) and vital sign attributes. We evaluate eight LLMs, spanning general-purpose and medical variants, under zero-shot setting. Model behavior is analyzed through (1) input-level sensitivity, capturing how counterfactuals alter perplexity, and (2) downstream reasoning, measuring their effect on predicted…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling