LLaVA-Med: Training a Large Language-and-Vision Assistant for   Biomedicine in One Day

Chunyuan Li; Cliff Wong; Sheng Zhang; Naoto Usuyama; Haotian Liu,; Jianwei Yang; Tristan Naumann; Hoifung Poon; Jianfeng Gao

arXiv:2306.00890·cs.CV·June 2, 2023·223 cites

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu,, Jianwei Yang, Tristan Naumann, Hoifung Poon, Jianfeng Gao

PDF

Open Access 1 Repo 10 Models 1 Datasets

TL;DR

LLaVA-Med is a cost-efficient, large-scale vision-language model trained in less than a day to assist biomedical research through multimodal conversation and question answering.

Contribution

The paper introduces a novel curriculum learning approach and a large biomedical figure-caption dataset to train a biomedical vision-language assistant rapidly and effectively.

Findings

01

LLaVA-Med outperforms previous models on biomedical VQA datasets.

02

The training process takes less than 15 hours using eight A100 GPUs.

03

The model demonstrates strong multimodal conversational capabilities.

Abstract

Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophistication in understanding and conversing about biomedical images. In this paper, we propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-following data from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/LLaVA-Med
pytorch

Models

Datasets

Kafoo/therascribe-gold-1M-with-images
dataset· 22 dl
22 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization