VIVID-Med: LLM-Supervised Structured Pretraining for Deployable Medical ViTs
Xiyao Wang, Xiaoyu Tan, Yang Dai, Yuxuan Fu, Shuo Li, Xihe Qiu

TL;DR
VIVID-Med introduces a novel LLM-supervised pretraining framework for medical vision transformers, effectively capturing complex clinical semantics and enabling deployable, resource-efficient models with superior performance across multiple medical imaging tasks.
Contribution
It presents a structured semantic supervision method using a frozen LLM and a new training scheme, significantly improving medical ViT pretraining and deployment efficiency.
Findings
Achieves 0.8588 macro-AUC on CheXpert, outperforming BiomedCLIP by +6.65 points.
Demonstrates strong zero-shot transfer to NIH ChestX-ray14 with 0.7225 macro-AUC.
Excels in cross-modality tasks, with 0.8413 AUC on LIDC-IDRI and 0.9969 on OrganAMNIST.
Abstract
Vision-language pretraining has driven significant progress in medical image analysis. However, current methods typically supervise visual encoders using one-hot labels or free-form text, neither of which effectively captures the complex semantic relationships among clinical findings. In this study, we introduce VIVID-Med, a novel framework that leverages a frozen large language model (LLM) as a structured semantic teacher to pretrain medical vision transformers (ViTs). VIVID-Med translates clinical findings into verifiable JSON field-state pairs via a Unified Medical Schema (UMS), utilizing answerability-aware masking to focus optimization. It then employs Structured Prediction Decomposition (SPD) to partition cross-attention into orthogonality-regularized query groups, extracting complementary visual aspects. Crucially, the LLM is discarded post-training, yielding a lightweight,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCOVID-19 diagnosis using AI · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
