TL;DR
RadJEPA is a self-supervised learning framework for chest X-ray encoders that predicts latent representations of masked regions without using language supervision, outperforming existing methods.
Contribution
It introduces a novel latent-space prediction approach in self-supervised learning for radiology images, eliminating the need for paired image-text data.
Findings
RadJEPA surpasses state-of-the-art methods like Rad-DINO in multiple benchmarks.
The model effectively learns from unlabeled chest X-ray images for various tasks.
It demonstrates strong performance in disease classification, segmentation, and report generation.
Abstract
Recent advances in medical vision language models guide the learning of visual representations; however, this form of supervision is constrained by the availability of paired image text data, raising the question of whether robust radiology encoders can be learned without relying on language supervision. In this work, we introduce RadJEPA, a self-supervised framework built on a Joint Embedding Predictive Architecture that learns without language supervision. Pre-trained solely on unlabeled chest X-ray images, the model learns to predict latent representations of masked image regions. This predictive objective differs fundamentally from both image text pre-training and DINO-style self-distillation: rather than aligning global representations across views or modalities, RadJEPA explicitly models latent-space prediction. We evaluate the learned encoder on disease classification, semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCOVID-19 diagnosis using AI · Domain Adaptation and Few-Shot Learning · Artificial Intelligence in Healthcare and Education
