Advancing High Resolution Vision-Language Models in Biomedicine
Zekai Chen, Arda Pekis, Kevin Brown

TL;DR
This paper introduces a new biomedical vision-language model with a specialized dataset and hierarchical image encoding, achieving state-of-the-art zero-shot performance in biomedical visual question answering.
Contribution
It presents a novel biomedical instruction dataset, a hierarchical image encoding strategy, and the Llama3-Med model with improved zero-shot accuracy.
Findings
Achieved over 10% performance improvement on biomedical VQA benchmarks.
Developed a new dataset with medical image-text pairs from Claude3-Opus and LLaMA3 70B.
Enhanced fine-grained visual understanding with hierarchical image representations.
Abstract
Multi-modal learning has significantly advanced generative AI, especially in vision-language modeling. Innovations like GPT-4V and open-source projects such as LLaVA have enabled robust conversational agents capable of zero-shot task completions. However, applying these technologies in the biomedical field presents unique challenges. Recent initiatives like LLaVA-Med have started to adapt instruction-tuning for biomedical contexts using large datasets such as PMC-15M. Our research offers three key contributions: (i) we present a new instruct dataset enriched with medical image-text pairs from Claude3-Opus and LLaMA3 70B, (ii) we propose a novel image encoding strategy using hierarchical representations to improve fine-grained biomedical visual comprehension, and (iii) we develop the Llama3-Med model, which achieves state-of-the-art zero-shot performance on biomedical visual question…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies
