TL;DR
SPECTRE is a scalable, fully transformer-based foundation model for volumetric CT that leverages self-supervised and vision-language pretraining to learn general-purpose, clinically meaningful representations from openly available datasets.
Contribution
The paper introduces SPECTRE, a novel 3D transformer architecture with joint local and global modeling, trained exclusively on open data, achieving state-of-the-art results in CT representation learning.
Findings
SPECTRE outperforms previous models on multiple CT benchmarks.
Pretraining with self-distillation and vision-language alignment improves clinical relevance.
The model is effective in both zero-shot and fine-tuned scenarios.
Abstract
We introduce SPECTRE, a fully transformer-based foundation model for volumetric computed tomography (CT). Our Self-Supervised & Cross-Modal Pretraining for CT Representation Extraction (SPECTRE) approach utilizes scalable 3D Vision Transformer architectures and modern self-supervised and vision-language pretraining strategies to learn general-purpose CT representations. Volumetric CT poses unique challenges, such as extreme token scaling, geometric anisotropy, and weak or noisy clinical supervision, that make standard transformer and contrastive learning recipes ineffective out of the box. The framework jointly optimizes a local transformer for high-resolution volumetric feature extraction and a global transformer for whole-scan context modeling, making large-scale 3D attention computationally tractable. Notably, SPECTRE is trained exclusively on openly available CT datasets,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
