LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu,, Jianwei Yang, Tristan Naumann, Hoifung Poon, Jianfeng Gao

TL;DR
LLaVA-Med is a cost-efficient, large-scale vision-language model trained in less than a day to assist biomedical research through multimodal conversation and question answering.
Contribution
The paper introduces a novel curriculum learning approach and a large biomedical figure-caption dataset to train a biomedical vision-language assistant rapidly and effectively.
Findings
LLaVA-Med outperforms previous models on biomedical VQA datasets.
The training process takes less than 15 hours using eight A100 GPUs.
The model demonstrates strong multimodal conversational capabilities.
Abstract
Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophistication in understanding and conversing about biomedical images. In this paper, we propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-following data from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗microsoft/llava-med-7b-deltamodel· 245 dl· ♡ 71245 dl♡ 71
- 🤗katielink/llava-med-7b-pathvqa-deltamodel· 6 dl· ♡ 16 dl♡ 1
- 🤗katielink/llava-med-7b-vqarad-deltamodel· 13 dl· ♡ 513 dl♡ 5
- 🤗katielink/llava-med-7b-slake-deltamodel· ♡ 1♡ 1
- 🤗saurabh-straive/llava_100k_finetunedmodel
- 🤗Straive/llava-1.5-13b-lora-100k-8-marmodel
- 🤗saurabh-straive/llava-1-5model
- 🤗GDinesh/llava-1-5model
- 🤗microsoft/llava-med-v1.5-mistral-7bmodel· 12k dl· ♡ 12012k dl♡ 120
- 🤗cifope/llava-med-tesseract-v1.5-mistral-7bmodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization
