Curriculum-Driven 3D CT Report Generation via Language-Free Visual Grafting and Zone-Constrained Compression
V. K. Cody Bumgardner, Mitchell A. Klusty, Mahmut S. Gokmen, and Evan W. Damron

TL;DR
This paper introduces Ker-VLJEPA-3B, a curriculum learning framework that generates 3D CT reports by grounding language models in visual features using a self-supervised, modality-pure visual backbone and innovative attention mechanisms.
Contribution
The paper proposes a novel curriculum training approach that integrates a self-supervised visual encoder with a language model for 3D CT report generation, avoiding reliance on paired text data.
Findings
Achieves state-of-the-art macro F1 score of 0.429 on CT-RATE benchmark.
Outperforms previous methods by 3.6% in macro F1 score.
56.6% of generation quality is attributed to patient-specific visual content.
Abstract
Automated radiology report generation from 3D computed tomography (CT) volumes is challenging due to extreme sequence lengths, severe class imbalance, and the tendency of large language models (LLMs) to ignore visual tokens in favor of linguistic priors. We present Ker-VLJEPA-3B, a four-phase curriculum learning framework for free-text report generation from thoracic CT volumes. A phased training curriculum progressively adapts a Llama 3.2 3B decoder to ground its output in visual features from a frozen, self-supervised encoder. Our visual backbone (LeJEPA ViT-Large) is trained via self-supervised joint-embedding prediction on unlabeled CTs, without text supervision. Unlike contrastive models (CLIP, BiomedCLIP), this language-free backbone yields modality-pure representations. Vision-language alignment is deferred to the curriculum's bridge and generation phases. This modality-agnostic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Artificial Intelligence in Healthcare and Education
