Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine

Xiaoshuang Huang; Lingdong Shen; Jia Liu; Fangxin Shang; Hongxiang Li; Haifeng Huang; Yehui Yang

arXiv:2412.09278·cs.CV·October 9, 2025

Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine

Xiaoshuang Huang, Lingdong Shen, Jia Liu, Fangxin Shang, Hongxiang Li, Haifeng Huang, Yehui Yang

PDF

Open Access 1 Repo

TL;DR

MedPLIB is a novel multimodal large language model with pixel-level understanding for biomedicine, enabling advanced visual question answering and pixel-level prompts, setting new state-of-the-art results in medical visual language tasks.

Contribution

The paper introduces MedPLIB, a biomedical multimodal LLM with pixel-level insight, and a multi-stage MoE training strategy, along with a new complex medical VQA dataset, advancing biomedical AI capabilities.

Findings

01

Achieves state-of-the-art results in medical visual language tasks.

02

Outperforms existing models in zero-shot pixel grounding evaluations.

03

Demonstrates effective multitask learning with MoE strategy.

Abstract

In recent years, Multimodal Large Language Models (MLLM) have achieved notable advancements, demonstrating the feasibility of developing an intelligent biomedical assistant. However, current biomedical MLLMs predominantly focus on image-level understanding and restrict interactions to textual commands, thus limiting their capability boundaries and the flexibility of usage. In this paper, we introduce a novel end-to-end multimodal large language model for the biomedical domain, named MedPLIB, which possesses pixel-level understanding. Excitingly, it supports visual question answering (VQA), arbitrary pixel-level prompts (points, bounding boxes, and free-form shapes), and pixel-level grounding. We propose a novel Mixture-of-Experts (MoE) multi-stage training strategy, which divides MoE into separate training phases for a visual-language expert model and a pixel-grounding expert model,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shawnhuang497/medplib
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsMixture of Experts · Focus