Evaluating the Diagnostic Classification Ability of Multimodal Large Language Models: Insights from the Osteoarthritis Initiative

Li Wang; Xi Chen; XiangWen Deng; HuaHui Yi; ZeKun Jiang; Kang Li; Jian Li

arXiv:2601.02443·cs.CV·January 7, 2026

Evaluating the Diagnostic Classification Ability of Multimodal Large Language Models: Insights from the Osteoarthritis Initiative

Li Wang, Xi Chen, XiangWen Deng, HuaHui Yi, ZeKun Jiang, Kang Li, Jian Li

PDF

Open Access

TL;DR

This study assesses the diagnostic classification capabilities of multimodal large language models in medical imaging, revealing that specialized vision encoders outperform full MLLM pipelines in classifying knee osteoarthritis from radiographs.

Contribution

It provides a systematic analysis of MLLM components in medical classification, highlighting the limited role of LLM fine-tuning and emphasizing the importance of vision encoder optimization and data quality.

Findings

01

Vision encoder alone outperforms full MLLM pipelines in accuracy.

02

Fine-tuning the LLM offers no significant improvement over prompt guidance.

03

Balanced, high-quality datasets are more impactful than larger, imbalanced ones.

Abstract

Multimodal large language models (MLLMs) show promising performance on medical visual question answering (VQA) and report generation, but these generation and explanation abilities do not reliably transfer to disease-specific classification. We evaluated MLLM architectures on knee osteoarthritis (OA) radiograph classification, which remains underrepresented in existing medical MLLM benchmarks, even though knee OA affects an estimated 300 to 400 million people worldwide. Through systematic ablation studies manipulating the vision encoder, the connector, and the large language model (LLM) across diverse training strategies, we measured each component's contribution to diagnostic accuracy. In our classification task, a trained vision encoder alone could outperform full MLLM pipelines in classification accuracy and fine-tuning the LLM provided no meaningful improvement over prompt-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Artificial Intelligence in Healthcare and Education