ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue
Zhangpu Li, Changhong Zou, Suxue Ma, Zhicheng Yang, Chen Du, Youbao, Tang, Zhenjie Cao, Ning Zhang, Jui-Hsin Lai, Ruei-Sung Lin, Yuan Ni, Xingzhi, Sun, Jing Xiao, Jieke Hou, Kai Zhang, Mei Han

TL;DR
ZALM3 introduces a zero-shot approach that leverages in-context information from multi-turn medical dialogues to enhance vision-language alignment in medical images, especially those of poor quality, improving diagnostic accuracy.
Contribution
The paper proposes ZALM3, a novel zero-shot method that uses dialogue context to improve vision-language alignment in multimodal medical dialogue, addressing poor image quality issues.
Findings
Significant improvement in vision-language alignment with ZALM3.
Effective noise reduction in patient-taken medical images.
Statistically significant performance gains across clinical departments.
Abstract
The rocketing prosperity of large language models (LLMs) in recent years has boosted the prevalence of vision-language models (VLMs) in the medical sector. In our online medical consultation scenario, a doctor responds to the texts and images provided by a patient in multiple rounds to diagnose her/his health condition, forming a multi-turn multimodal medical dialogue format. Unlike high-quality images captured by professional equipment in traditional medical visual question answering (Med-VQA), the images in our case are taken by patients' mobile phones. These images have poor quality control, with issues such as excessive background elements and the lesion area being significantly off-center, leading to degradation of vision-language alignment in the model training phase. In this paper, we propose ZALM3, a Zero-shot strategy to improve vision-language ALignment in Multi-turn…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Multimodal Machine Learning Applications
