SemiHVision: Enhancing Medical Multimodal Models with a Semi-Human   Annotated Dataset and Fine-Tuned Instruction Generation

Junda Wang; Yujan Ting; Eric Z. Chen; Hieu Tran; Hong Yu; Weijing; Huang; Terrence Chen

arXiv:2410.14948·cs.CL·October 22, 2024

SemiHVision: Enhancing Medical Multimodal Models with a Semi-Human Annotated Dataset and Fine-Tuned Instruction Generation

Junda Wang, Yujan Ting, Eric Z. Chen, Hieu Tran, Hong Yu, Weijing, Huang, Terrence Chen

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces SemiHVision, a new medical multimodal dataset, and fine-tunes a model that significantly improves diagnostic reasoning and performance in real-world clinical tasks, bridging the gap between research and practice.

Contribution

It presents SemiHVision, a semi-human annotated dataset, and a fine-tuned model PMC-Cambrian-8B-AN that outperforms existing models on medical benchmarks and introduces a new clinical evaluation benchmark.

Findings

01

PMC-Cambrian-8B-AN surpasses public and private medical models on traditional benchmarks.

02

The model achieves state-of-the-art results on the JAMA Clinical Challenge benchmark.

03

Traditional benchmarks do not fully reflect real-world clinical task performance.

Abstract

Multimodal large language models (MLLMs) have made significant strides, yet they face challenges in the medical domain due to limited specialized knowledge. While recent medical MLLMs demonstrate strong performance in lab settings, they often struggle in real-world applications, highlighting a substantial gap between research and practice. In this paper, we seek to address this gap at various stages of the end-to-end learning pipeline, including data collection, model fine-tuning, and evaluation. At the data collection stage, we introduce SemiHVision, a dataset that combines human annotations with automated augmentation techniques to improve both medical knowledge representation and diagnostic reasoning. For model fine-tuning, we trained PMC-Cambrian-8B-AN over 2400 H100 GPU hours, resulting in performance that surpasses public medical models like HuatuoGPT-Vision-34B (79.0% vs. 66.7%)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

believewhat/SemiHVision
noneOfficial

Datasets

akemiH/SemiHVision
dataset· 40 dl
40 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Residual Connection · Position-Wise Feed-Forward Layer · Dense Connections · Softmax · Multi-Head Attention · Adam · Dropout