LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound
Xuechen Guo, Wenhao Chai, Shi-Yan Li, Gaoang Wang

TL;DR
LLaVA-Ultra is a specialized multimodal model that combines Chinese language understanding with ultrasound image analysis, enabling accurate medical visual question answering through fine-grained, data-efficient training.
Contribution
The paper introduces a novel architecture with a fusion module and weighted scoring for medical images, along with a large-scale Chinese ultrasound dataset for effective fine-tuning.
Findings
Outperforms previous models on Med-VQA datasets
Demonstrates robustness in medical ultrasound scenarios
Achieves state-of-the-art accuracy in medical visual question answering
Abstract
Multimodal Large Language Model (MLLM) has recently garnered attention as a prominent research focus. By harnessing powerful LLM, it facilitates a transition of conversational generative AI from unimodal text to performing multimodal tasks. This boom begins to significantly impact medical field. However, general visual language model (VLM) lacks sophisticated comprehension for medical visual question answering (Med-VQA). Even models specifically tailored for medical domain tend to produce vague answers with weak visual relevance. In this paper, we propose a fine-grained adaptive VLM architecture for Chinese medical visual conversations through parameter-efficient tuning. Specifically, we devise a fusion module with fine-grained vision encoders to achieve enhancement for subtle medical visual semantics. Then we note data redundancy common to medical scenes is ignored in most prior works.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · COVID-19 diagnosis using AI · Artificial Intelligence Applications
MethodsSoftmax · Attention Is All You Need · Knowledge Distillation
