TL;DR
This paper introduces a lightweight multimodal retrieval system for medical diagnosis that improves clinical classification and VQA tasks using general-purpose models, with analysis of retrieval errors.
Contribution
It presents a novel lightweight LVLM-aware multimodal retriever trained with minimal data, enhancing retrieval-augmented diagnosis without extensive medical pre-training.
Findings
Retrieval optimization improves inconsistent retrieval cases.
Lightweight fine-tuning achieves competitive results with less data.
Analysis reveals challenges in LVLMs utilizing retrieved info for predictions.
Abstract
Retrieving visual and textual information from medical literature and hospital records can enhance diagnostic accuracy for clinical image interpretation. However, multimodal retrieval-augmented diagnosis is highly challenging. We explore a lightweight mechanism for enhancing diagnostic performance of retrieval-augmented LVLMs. We train a lightweight LVLM-aware multimodal retriever, such that the retriever learns to return images and texts that guide the LVLM toward correct predictions. In our low-resource setting, we perform only lightweight fine-tuning with small amounts of data, and use only general-purpose backbone models, achieving competitive results in clinical classification and VQA tasks compared to medically pre-trained models with extensive training. In a novel analysis, we highlight a previously unexplored class of errors that we term inconsistent retrieval predictions: cases…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
