CLIPSyntel: CLIP and LLM Synergy for Multimodal Question Summarization in Healthcare
Akash Ghosh, Arkadeep Acharya, Raghav Jain, Sriparna Saha, Aman, Chadha, Setu Sinha

TL;DR
This paper introduces a multimodal framework combining CLIP and LLMs to generate medical question summaries that incorporate visual information, improving understanding and decision-making in healthcare.
Contribution
It presents the MMQS dataset pairing medical queries with visual aids and a novel multimodal summarization framework utilizing CLIP and LLMs for enhanced medical query understanding.
Findings
Visual cues improve summary quality
Multimodal approach enhances medical understanding
Framework outperforms text-only methods
Abstract
In the era of modern healthcare, swiftly generating medical question summaries is crucial for informed and timely patient care. Despite the increasing complexity and volume of medical data, existing studies have focused solely on text-based summarization, neglecting the integration of visual information. Recognizing the untapped potential of combining textual queries with visual representations of medical conditions, we introduce the Multimodal Medical Question Summarization (MMQS) Dataset. This dataset, a major contribution to our work, pairs medical queries with visual aids, facilitating a richer and more nuanced understanding of patient needs. We also propose a framework, utilizing the power of Contrastive Language Image Pretraining(CLIP) and Large Language Models(LLMs), consisting of four modules that identify medical disorders, generate relevant context, filter medical concepts,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Text and Document Classification Technologies
MethodsContrastive Language-Image Pre-training · Attentive Walk-Aggregating Graph Neural Network
