MEDMKG: Benchmarking Medical Knowledge Exploitation with Multimodal Knowledge Graph

Xiaochen Wang; Yuan Zhong; Lingwei Zhang; Lisong Dai; Ting Wang; Fenglong Ma

arXiv:2505.17214·cs.AI·May 26, 2025

MEDMKG: Benchmarking Medical Knowledge Exploitation with Multimodal Knowledge Graph

Xiaochen Wang, Yuan Zhong, Lingwei Zhang, Lisong Dai, Ting Wang, Fenglong Ma

PDF

5 Reviews

TL;DR

MEDMKG introduces a comprehensive multimodal medical knowledge graph combining imaging and textual data, enhancing medical AI performance and providing a new resource for multimodal knowledge integration.

Contribution

We propose MEDMKG, the first large-scale multimodal medical knowledge graph linking imaging and clinical concepts, with a novel filtering algorithm and extensive benchmarking.

Findings

01

Improves downstream medical task performance

02

Provides a robust foundation for multimodal knowledge integration

03

Benchmarking of 24 methods on 6 datasets

Abstract

Medical deep learning models depend heavily on domain-specific knowledge to perform well on knowledge-intensive clinical tasks. Prior work has primarily leveraged unimodal knowledge graphs, such as the Unified Medical Language System (UMLS), to enhance model performance. However, integrating multimodal medical knowledge graphs remains largely underexplored, mainly due to the lack of resources linking imaging data with clinical concepts. To address this gap, we propose MEDMKG, a Medical Multimodal Knowledge Graph that unifies visual and textual medical information through a multi-stage construction pipeline. MEDMKG fuses the rich multimodal data from MIMIC-CXR with the structured clinical knowledge from UMLS, utilizing both rule-based tools and large language models for accurate concept extraction and relationship modeling. To ensure graph quality and compactness, we introduce…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 2

Strengths

1. This paper addresses an important gap between unimodal medical knowledge graphs and vision–language models in healthcare. Radio graph KG is relatively rare, so I feel this work represents a notable contribution to the field. 2. The proposed NaF method is a simple but effective heuristic that balances connectivity and distinctiveness when selecting representative images. 3. The dataset is comprehensively evaluated, and the authors provide a thoughtful discussion of performance trends across

Weaknesses

1. Results in Table 2, 3 and 4 do not have an error bound, yet lacking statistical power. I would suggest author to add the error bound as sometimes two numbers are quite close in the table, and we cannot tell if they are significantly different. 2. The paper uses gpt-4o for cross-modality relation extraction. I'm unsure if there is any bias. Maybe the author could add some small-scale ablation studies to check the alignment between gpt-4o and other LLMs. 3. In qualitative analysis, the paper

Reviewer 02Rating 4Confidence 4

Strengths

1. Elegant filtering method: NaF combines a TF–IDF‑like log rarity term (\log(M / |N(r,c)|)) elegantly balances redundancy reduction and rarity based high-value selection. The combination with greedy concept coverage ensures the resulting graph remains both compact and concept-rich. This is well suited for redundancy-heavy medical imaging corpora. It is also agnostic to type of images or domain. 2. The link prediction table (Table 2) compares 17 KGE models across head / relation / tail predicti

Weaknesses

1. Limited Multimodality (CXR-Only): Despite claims of general multimodal capability, all experiments use only chest X-rays. This confines multimodality to “text + chest images + ontology,” not multiple image types. Consequently, the claim of extending UMLS to multimodal space is overstated. The graph does not yet demonstrate coverage for other imaging modalities (CT, MRI, ultrasounds, etc.) though base UMLS does. 2. Overstated Benchmark Diversity: The claim of “six diverse datasets” and “five t

Reviewer 03Rating 2Confidence 5

Strengths

The work makes a clear attempt to build a structured resource that connects imaging data with clinical concepts and to organize that resource in a way that can be consumed by standard models. The paper also provides benchmark tasks (link prediction, retrieval, VQA) and reports baseline performance numbers on them, which makes it easier for future work to compare under similar settings.

Weaknesses

Overall, while the paper presents MEDMKG as a broadly useful medical multimodal knowledge graph and reports promising downstream results, there are several issues in the scope of the resource, the reliability of how it is constructed, and how its impact is evaluated. W1: The paper presents MEDMKG as a Medical Multimodal Knowledge Graph that can support knowledge intensive clinical tasks and unify visual and textual medical knowledge. In practice, the resource is almost entirely limited to chest

Reviewer 04Rating 2Confidence 3

Strengths

1. Comprehensive baseline coverage for link prediction tasks. 2. Integrating structured clinical knowledge with medical imaging is quite novel in the clinical ML domain.

Weaknesses

W1. The paper claims that multimodal KGs outperform text-only KGs, but no direct comparison is provided. Without a UMLS-only baseline, the improvement cannot be attributed to multimodality itself, leaving the core hypothesis untested. W2. The proposed Neighbor-aware Filtering (NaF) is a handcrafted scoring rule combining neighbor count and distinctiveness. It is not compared to simpler alternatives such as random sampling, degree-based filtering, or embedding-based diversity selection. Without

Reviewer 05Rating 6Confidence 3

Strengths

1. It breaks the limitations of traditional unimodal medical knowledge graphs by innovatively integrating MIMIC-CXR imaging data with UMLS clinical knowledge to construct MEDMKG, the first multimodal knowledge graph tailored for medical scenarios. The proposed NaF algorithm offers a new approach to redundancy filtering in multimodal knowledge graphs by balancing image connectivity and uniqueness. 2.The construction process is rigorous, combining MetaMap and GPT-4o to achieve high-precision conc

Weaknesses

1.The core parameters of the NaF algorithm (e.g., the determination method of M in the formula) are not elaborated, and no parameter sensitivity analysis is conducted to verify the impact of different parameter settings on graph quality and downstream task performance. Furthermore, there is no comparison with other mainstream knowledge graph filtering algorithms, making it difficult to highlight the advantages of NaF. 2.Experimental data mainly rely on MIMIC-CXR (chest X-ray images) and UMLS, w

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.