The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation
Doan Nam Long Vu, Simone Balloccu

TL;DR
This study reveals that prompt framing significantly influences perceived multimodal performance in clinical vision-language models, often due to superficial cues rather than genuine evidence integration, raising concerns for clinical deployment.
Contribution
The paper uncovers the 'scaffold effect', showing how prompt wording can create false impressions of multimodal understanding without real evidence processing.
Findings
Smaller VLMs gain up to 58% F1 with neuroimaging context due to prompt framing.
Mentioning MRI in prompts accounts for 70-80% of performance shifts, independent of data presence.
Expert evaluation shows fabricated neuroimaging justifications across conditions.
Abstract
Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to 58\% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely \emph{mentioning} MRI availability in the task prompt accounts for 70-80\% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
