Measuring What VLMs Don't Say: Validation Metrics Hide Clinical Terminology Erasure in Radiology Report Generation
Aditya Parikh, Aasa Feragen, Sneha Das, Stella Frank

TL;DR
This paper highlights the limitations of current validation metrics for Vision-Language Models in radiology, revealing how they can mask clinical terminology omission and bias, and proposes new lexical diversity and association measures for better evaluation.
Contribution
It introduces Clinical Association Displacement (CAD) and Weighted Association Erasure (WAE), novel metrics to assess clinical fidelity and demographic fairness in radiology report generation.
Findings
Deterministic decoding leads to clinical terminology erasure.
Stochastic sampling increases diversity but may introduce bias.
Current metrics can be gamed despite high token-overlap scores.
Abstract
Reliable deployment of Vision-Language Models (VLMs) in radiology requires validation metrics that go beyond surface-level text similarity to ensure clinical fidelity and demographic fairness. This paper investigates a critical blind spot in current model evaluation: the use of decoding strategies that lead to high aggregate token-overlap scores despite succumbing to template collapse, in which models generate only repetitive, safe generic text and omit clinical terminology. Unaddressed, this blind spot can lead to metric gaming, where models that perform well on benchmarks prove clinically uninformative. Instead, we advocate for lexical diversity measures to check model generations for clinical specificity. We introduce Clinical Association Displacement (CAD), a vocabulary-level framework that quantifies shifts in demographic-based word associations in generated reports. Weighted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiology practices and education · Artificial Intelligence in Healthcare and Education · Radiomics and Machine Learning in Medical Imaging
