MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants
Hritik Bansal, Daniel Israel, Siyan Zhao, Shufan Li, Tung Nguyen,, Aditya Grover

TL;DR
MedMax introduces a large-scale, diverse multimodal dataset for biomedical instruction tuning, significantly enhancing the performance of foundation models in biomedical visual question answering and related tasks.
Contribution
We created MedMax, a comprehensive multimodal biomedical dataset, and demonstrated its effectiveness in fine-tuning models for improved biomedical AI assistance.
Findings
26% performance improvement over Chameleon
18.3% improvement over GPT-4o in biomedical VQA
Diverse tasks across biomedical domains
Abstract
Recent advancements in mixed-modal generative have opened new avenues for developing unified biomedical assistants capable of analyzing biomedical images, answering complex questions about them, and generating multimodal patient reports. However, existing datasets face challenges such as small sizes, limited coverage of biomedical tasks and domains, and a reliance on narrow sources. To address these gaps, we present MedMax, a large-scale multimodal biomedical instruction-tuning dataset for mixed-modal foundation models. With 1.47 million instances, MedMax encompasses a diverse range of tasks, including interleaved image-text generation, biomedical image captioning and generation, visual chat, and report understanding. These tasks span knowledge across diverse biomedical domains, including radiology and histopathology, grounded in medical papers and YouTube videos. Subsequently, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAssistive Technology in Communication and Mobility
