TL;DR
CheXmix is a unified early-fusion generative model for medical imaging that improves performance on classification and report generation tasks by integrating visual and textual data effectively.
Contribution
It introduces a two-stage multimodal pretraining strategy that combines masked autoencoders with generative models, enhancing medical image understanding.
Findings
Outperforms existing models by 6.0% on AUROC at high masking ratios.
Increases image inpainting capability by 51.0%.
Outperforms CheXagent by 8.6% on AUROC and 45% on GREEN metric.
Abstract
Recent medical multimodal foundation models are built as multimodal LLMs (MLLMs) by connecting a CLIP-pretrained vision encoder to an LLM using LLaVA-style finetuning. This two-stage, decoupled approach introduces a projection layer that can distort visual features. This is especially concerning in medical imaging where subtle cues are essential for accurate diagnoses. In contrast, early-fusion generative approaches such as Chameleon eliminate the projection bottleneck by processing image and text tokens within a single unified sequence, enabling joint representation learning that leverages the inductive priors of language models. We present CheXmix, a unified early-fusion generative model trained on a large corpus of chest X-rays paired with radiology reports. We expand on Chameleon's autoregressive framework by introducing a two-stage multimodal generative pretraining strategy that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
