PSF-Med: Measuring and Explaining Paraphrase Sensitivity in Medical Vision Language Models
Binesh Sadanandan, Vahid Behzadan

TL;DR
This paper introduces PSF-Med, a comprehensive benchmark for measuring paraphrase sensitivity in medical vision language models, revealing their reliance on language priors and identifying mechanisms behind answer flips.
Contribution
The study presents PSF-Med, a large-scale benchmark with validated paraphrases, and analyzes model mechanisms, showing how specific features influence answer stability and proposing methods to improve robustness.
Findings
Flip rates range from 3% to 37% across models.
Removing a sparse feature reduces flip rates by 31%.
Models often rely on language priors even without visual input.
Abstract
Medical Vision Language Models (VLMs) can change their answers when clinicians rephrase the same question, a failure mode that threatens deployment safety. We introduce PSF-Med, a benchmark of 26,850 chest X-ray questions paired with 92,856 meaning-preserving paraphrases across MIMIC-CXR, PadChest, and VinDr-CXR, spanning clinical populations in the US, Spain, and Vietnam. Every paraphrase is validated by an LLM judge using a bidirectional clinical entailment rubric, with 91.6% cross-family agreement. Across nine VLMs, including general-purpose models, we find flip rates from 3% to 37%. However, low flip rate does not imply visual grounding: text-only baselines show that some models stay consistent even when the image is removed, suggesting they rely on language priors. To study mechanisms in one model, we apply GemmaScope 2 Sparse Autoencoders (SAEs) to MedGemma 4B and analyze…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
