TL;DR
This paper introduces MedITok, a unified medical image tokenizer trained on extensive multimodal data, enabling advanced autoregressive synthesis and understanding across diverse medical imaging modalities.
Contribution
It presents a novel two-stage training framework that leverages unpaired images and paired image-text data to build a versatile medical image tokenizer.
Findings
Achieves state-of-the-art results on over 30 benchmarks across 9 modalities.
Utilizes over 33 million images and 2 million image-text pairs for training.
Enables autoregressive modeling for diagnostic and generative medical applications.
Abstract
Autoregressive modeling has driven major advances in multimodal AI, yet its application to medical imaging remains constrained by the absence of a unified image tokenizer that simultaneously preserves fine-grained anatomical structures and rich clinical semantics across heterogeneous modalities. Existing approaches jointly optimize image reconstruction and textual semantic objectives, relying on large-scale image-caption pairs and are prone to gradient interference. This is ill-suited for the medical domain where paired data are scarce and abundant unpaired images remain unexploited. This work identifies these issues in building unified medical image tokenizers, and introduces a principled two-stage training framework using visual representation as a bridge to address them. The propose visual representation alignment stage enables the utilization of large-scale unpaired medical images…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
