Visual Prompt Engineering for Vision Language Models in Radiology

Stefan Denner; Markus Bujotzek; Dimitrios Bounias; David Zimmerer; Raphael Stock; Klaus Maier-Hein

arXiv:2408.15802·cs.CV·June 24, 2025·2 cites

Visual Prompt Engineering for Vision Language Models in Radiology

Stefan Denner, Markus Bujotzek, Dimitrios Bounias, David Zimmerer, Raphael Stock, Klaus Maier-Hein

PDF

Open Access

TL;DR

This paper introduces a method to improve zero-shot radiology image classification by embedding visual cues like arrows and bounding boxes into images, which guides models to focus on relevant regions and enhances interpretability and accuracy.

Contribution

The study demonstrates that incorporating visual markers into radiological images significantly improves classification performance and interpretability in zero-shot settings.

Findings

01

Visual markers increase AUROC by up to 0.185.

02

Attention maps show models focus on clinically relevant areas.

03

Method is validated across four public chest X-ray datasets.

Abstract

Medical image classification plays a crucial role in clinical decision-making, yet most models are constrained to a fixed set of predefined classes, limiting their adaptability to new conditions. Contrastive Language-Image Pretraining (CLIP) offers a promising solution by enabling zero-shot classification through multimodal large-scale pretraining. However, while CLIP effectively captures global image content, radiology requires a more localized focus on specific pathology regions to enhance both interpretability and diagnostic accuracy. To address this, we explore the potential of incorporating visual cues into zero-shot classification, embedding visual markers, such as arrows, bounding boxes, and circles, directly into radiological images to guide model attention. Evaluating across four public chest X-ray datasets, we demonstrate that visual markers improve AUROC by up to 0.185,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Semantic Web and Ontologies

MethodsSoftmax · Attention Is All You Need · Focus · Contrastive Language-Image Pre-training