ExGra-Med: Extended Context Graph Alignment for Medical Vision-Language Models

Duy M. H. Nguyen; Nghiem T. Diep; Trung Q. Nguyen; Hoang-Bao Le; Tai Nguyen; Tien Nguyen; TrungTin Nguyen; Nhat Ho; Pengtao Xie; Roger Wattenhofer; James Zou; Daniel Sonntag; Mathias Niepert

arXiv:2410.02615·cs.LG·November 10, 2025

ExGra-Med: Extended Context Graph Alignment for Medical Vision-Language Models

Duy M. H. Nguyen, Nghiem T. Diep, Trung Q. Nguyen, Hoang-Bao Le, Tai Nguyen, Tien Nguyen, TrungTin Nguyen, Nhat Ho, Pengtao Xie, Roger Wattenhofer, James Zou, Daniel Sonntag, Mathias Niepert

PDF

Open Access 1 Video

TL;DR

ExGra-Med introduces a multi-graph alignment framework for medical vision-language models that improves semantic grounding and coherence, achieving high performance with significantly less pre-training data.

Contribution

The paper presents a novel multi-graph alignment method and an efficient training scheme for large medical vision-language models, enhancing alignment and reducing data requirements.

Findings

01

Matches LLaVA-Med performance with only 10% of pre-training data

02

Achieves 20.13% improvement on VQA-RAD

03

Outperforms BioMedGPT and RadFM on key tasks

Abstract

State-of-the-art medical multi-modal LLMs (med-MLLMs), such as LLaVA-Med and BioMedGPT, primarily depend on scaling model size and data volume, with training driven largely by autoregressive objectives. However, we reveal that this approach can lead to weak vision-language alignment, making these models overly dependent on costly instruction-following data. To address this, we introduce ExGra-Med, a novel multi-graph alignment framework that jointly aligns images, instruction responses, and extended captions in the latent space, advancing semantic grounding and cross-modal coherence. To scale to large LLMs (e.g., LLaMA-7B), we develop an efficient end-to-end training scheme using black-box gradient estimation, enabling fast and scalable optimization. Empirically, ExGra-Med matches LLaVA-Med's performance using just 10% of the pre-training data, achieving a 20.13% gain on VQA-RAD and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ExGra-Med: Extended Context Graph Alignment for Medical Vision-Language Models· slideslive

Taxonomy

TopicsImage Retrieval and Classification Techniques · Biomedical Text Mining and Ontologies · Multimodal Machine Learning Applications

MethodsLLaMA · Focus