Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models

Sushant Gautam; Michael A. Riegler; P{\aa}l Halvorsen

arXiv:2505.16647·cs.CV·September 3, 2025

Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models

Sushant Gautam, Michael A. Riegler, P{\aa}l Halvorsen

PDF

1 Repo 1 Models 1 Datasets

TL;DR

This paper explores fine-tuning instruction-tuned vision-language models for multi-task medical image understanding, demonstrating improved robustness and accuracy in detection, localization, and counting tasks, with potential for more explainable medical AI.

Contribution

It introduces a multi-task fine-tuning approach for VLMs on medical images using instruction prompts, showing enhanced performance and interpretability for clinical applications.

Findings

01

Multi-task training improves model robustness and accuracy.

02

Fine-tuning reduces counting MAE and increases matching accuracy.

03

Trade-offs include more zero-case predictions, affecting reliability.

Abstract

We investigate fine-tuning Vision-Language Models (VLMs) for multi-task medical image understanding, focusing on detection, localization, and counting of findings in medical images. Our objective is to evaluate whether instruction-tuned VLMs can simultaneously improve these tasks, with the goal of enhancing diagnostic accuracy and efficiency. Using MedMultiPoints, a multimodal dataset with annotations from endoscopy (polyps and instruments) and microscopy (sperm cells), we reformulate each task into instruction-based prompts suitable for vision-language reasoning. We fine-tune Qwen2.5-VL-7B-Instruct using Low-Rank Adaptation (LoRA) across multiple task combinations. Results show that multi-task training improves robustness and accuracy. For example, it reduces the Count Mean Absolute Error (MAE) and increases Matching Accuracy in the Counting + Pointing task. However, trade-offs emerge,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

simula/pointdetectcount
pytorchOfficial

Models

🤗
SimulaMet/PointDetectCount-Qwen2.5-VL-7B-LoRA
model· 3 dl
3 dl

Datasets

SimulaMet/MedMultiPoints
dataset· 109 dl
109 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.