Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models
Sushant Gautam, Michael A. Riegler, P{\aa}l Halvorsen

TL;DR
This paper explores fine-tuning instruction-tuned vision-language models for multi-task medical image understanding, demonstrating improved robustness and accuracy in detection, localization, and counting tasks, with potential for more explainable medical AI.
Contribution
It introduces a multi-task fine-tuning approach for VLMs on medical images using instruction prompts, showing enhanced performance and interpretability for clinical applications.
Findings
Multi-task training improves model robustness and accuracy.
Fine-tuning reduces counting MAE and increases matching accuracy.
Trade-offs include more zero-case predictions, affecting reliability.
Abstract
We investigate fine-tuning Vision-Language Models (VLMs) for multi-task medical image understanding, focusing on detection, localization, and counting of findings in medical images. Our objective is to evaluate whether instruction-tuned VLMs can simultaneously improve these tasks, with the goal of enhancing diagnostic accuracy and efficiency. Using MedMultiPoints, a multimodal dataset with annotations from endoscopy (polyps and instruments) and microscopy (sperm cells), we reformulate each task into instruction-based prompts suitable for vision-language reasoning. We fine-tune Qwen2.5-VL-7B-Instruct using Low-Rank Adaptation (LoRA) across multiple task combinations. Results show that multi-task training improves robustness and accuracy. For example, it reduces the Count Mean Absolute Error (MAE) and increases Matching Accuracy in the Counting + Pointing task. However, trade-offs emerge,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
