CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging

Ashwin Kumar; Robbie Holland; Corey Barrett; Jangwon Kim; Maya Varma; Zhihong Chen; Yunhe Gao; Greg Zaharchuk; Tara Taghavi; Krishnaram Kenthapadi; Akshay Chaudhari

arXiv:2604.22989·cs.CV·April 28, 2026

CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging

Ashwin Kumar, Robbie Holland, Corey Barrett, Jangwon Kim, Maya Varma, Zhihong Chen, Yunhe Gao, Greg Zaharchuk, Tara Taghavi, Krishnaram Kenthapadi, Akshay Chaudhari

PDF

1 Repo

TL;DR

CheXmix is a unified early-fusion generative model for medical imaging that improves performance on classification and report generation tasks by integrating visual and textual data effectively.

Contribution

It introduces a two-stage multimodal pretraining strategy that combines masked autoencoders with generative models, enhancing medical image understanding.

Findings

01

Outperforms existing models by 6.0% on AUROC at high masking ratios.

02

Increases image inpainting capability by 51.0%.

03

Outperforms CheXagent by 8.6% on AUROC and 45% on GREEN metric.

Abstract

Recent medical multimodal foundation models are built as multimodal LLMs (MLLMs) by connecting a CLIP-pretrained vision encoder to an LLM using LLaVA-style finetuning. This two-stage, decoupled approach introduces a projection layer that can distort visual features. This is especially concerning in medical imaging where subtle cues are essential for accurate diagnoses. In contrast, early-fusion generative approaches such as Chameleon eliminate the projection bottleneck by processing image and text tokens within a single unified sequence, enabling joint representation learning that leverages the inductive priors of language models. We present CheXmix, a unified early-fusion generative model trained on a large corpus of chest X-rays paired with radiology reports. We expand on Chameleon's autoregressive framework by introducing a two-stage multimodal generative pretraining strategy that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

StanfordMIMI/CheXmix
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.