VIVID-Med: LLM-Supervised Structured Pretraining for Deployable Medical ViTs

Xiyao Wang; Xiaoyu Tan; Yang Dai; Yuxuan Fu; Shuo Li; Xihe Qiu

arXiv:2603.09109·cs.CV·March 12, 2026

VIVID-Med: LLM-Supervised Structured Pretraining for Deployable Medical ViTs

Xiyao Wang, Xiaoyu Tan, Yang Dai, Yuxuan Fu, Shuo Li, Xihe Qiu

PDF

Open Access

TL;DR

VIVID-Med introduces a novel LLM-supervised pretraining framework for medical vision transformers, effectively capturing complex clinical semantics and enabling deployable, resource-efficient models with superior performance across multiple medical imaging tasks.

Contribution

It presents a structured semantic supervision method using a frozen LLM and a new training scheme, significantly improving medical ViT pretraining and deployment efficiency.

Findings

01

Achieves 0.8588 macro-AUC on CheXpert, outperforming BiomedCLIP by +6.65 points.

02

Demonstrates strong zero-shot transfer to NIH ChestX-ray14 with 0.7225 macro-AUC.

03

Excels in cross-modality tasks, with 0.8413 AUC on LIDC-IDRI and 0.9969 on OrganAMNIST.

Abstract

Vision-language pretraining has driven significant progress in medical image analysis. However, current methods typically supervise visual encoders using one-hot labels or free-form text, neither of which effectively captures the complex semantic relationships among clinical findings. In this study, we introduce VIVID-Med, a novel framework that leverages a frozen large language model (LLM) as a structured semantic teacher to pretrain medical vision transformers (ViTs). VIVID-Med translates clinical findings into verifiable JSON field-state pairs via a Unified Medical Schema (UMS), utilizing answerability-aware masking to focus optimization. It then employs Structured Prediction Decomposition (SPD) to partition cross-attention into orthogonality-regularized query groups, extracting complementary visual aspects. Crucially, the LLM is discarded post-training, yielding a lightweight,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCOVID-19 diagnosis using AI · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications