SemiHVision: Enhancing Medical Multimodal Models with a Semi-Human Annotated Dataset and Fine-Tuned Instruction Generation
Junda Wang, Yujan Ting, Eric Z. Chen, Hieu Tran, Hong Yu, Weijing, Huang, Terrence Chen

TL;DR
This paper introduces SemiHVision, a new medical multimodal dataset, and fine-tunes a model that significantly improves diagnostic reasoning and performance in real-world clinical tasks, bridging the gap between research and practice.
Contribution
It presents SemiHVision, a semi-human annotated dataset, and a fine-tuned model PMC-Cambrian-8B-AN that outperforms existing models on medical benchmarks and introduces a new clinical evaluation benchmark.
Findings
PMC-Cambrian-8B-AN surpasses public and private medical models on traditional benchmarks.
The model achieves state-of-the-art results on the JAMA Clinical Challenge benchmark.
Traditional benchmarks do not fully reflect real-world clinical task performance.
Abstract
Multimodal large language models (MLLMs) have made significant strides, yet they face challenges in the medical domain due to limited specialized knowledge. While recent medical MLLMs demonstrate strong performance in lab settings, they often struggle in real-world applications, highlighting a substantial gap between research and practice. In this paper, we seek to address this gap at various stages of the end-to-end learning pipeline, including data collection, model fine-tuning, and evaluation. At the data collection stage, we introduce SemiHVision, a dataset that combines human annotations with automated augmentation techniques to improve both medical knowledge representation and diagnostic reasoning. For model fine-tuning, we trained PMC-Cambrian-8B-AN over 2400 H100 GPU hours, resulting in performance that surpasses public medical models like HuatuoGPT-Vision-34B (79.0% vs. 66.7%)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Residual Connection · Position-Wise Feed-Forward Layer · Dense Connections · Softmax · Multi-Head Attention · Adam · Dropout
