ClinKD: Cross-Modal Clinical Knowledge Distiller For Multi-Task Medical Images
Hongyu Ge, Longkun Hao, Zihui Xu, Zhenxin Lin, Bin Li, Shoujun Zhou, Hongjin Zhao, Yihang Liu

TL;DR
ClinKD is a novel framework that improves multimodal large language models for medical visual question answering by enhancing image-text alignment and medical knowledge transfer, leading to state-of-the-art results.
Contribution
Introduces ClinKD, a cross-modal knowledge distillation framework that addresses image-text misalignment and domain knowledge gaps in medical VQA tasks.
Findings
Achieves state-of-the-art performance on challenging Med-VQA datasets.
Significantly improves image-text alignment in medical multimodal models.
Enables better medical knowledge adaptation in large language models.
Abstract
Medical Visual Question Answering (Med-VQA) represents a critical and challenging subtask within the general VQA domain. Despite significant advancements in general VQA, multimodal large language models (MLLMs) still exhibit substantial limitations when handling multi-task VQA scenarios. These limitations manifest through erroneous spatial localization and misinterpretation of medical images, which primarily arise from two fundamental issues: inadequate image-text alignment and insufficient domain-specified knowledge for medical applications. To address these issues, we introduce the Cross-Modal Clinical Knowledge Distiller (ClinKD), an innovative framework designed to enhance image-text alignment and establish more effective medical knowledge transformation mechanisms, which enables MLLMs to perform better even when lacking prior medical knowledge. Our extensive experimental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in cancer detection
