TL;DR
LoVT is a novel pre-training approach that combines image and report data to improve localized medical imaging tasks like segmentation and detection, outperforming existing methods on multiple benchmarks.
Contribution
It introduces the first text-supervised pre-training method specifically designed for localized medical imaging tasks, integrating instance-level and local contrastive learning.
Findings
LoVT outperforms other methods on 10 of 18 localized tasks.
It demonstrates significant improvements in segmentation and detection accuracy.
The approach is effective across multiple chest X-ray datasets.
Abstract
Contrastive learning has proven effective for pre-training image models on unlabeled data with promising results for tasks such as medical image classification. Using paired text (like radiological reports) during pre-training improves the results even further. Still, most existing methods target image classification downstream tasks and may not be optimal for localized tasks like semantic segmentation or object detection. We therefore propose Localized representation learning from Vision and Text (LoVT), to our best knowledge, the first text-supervised pre-training method that targets localized medical imaging tasks. Our method combines instance-level image-report contrastive learning with local contrastive learning on image region and report sentence representations. We evaluate LoVT and commonly used pre-training methods on an evaluation framework of 18 localized tasks on chest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Learning
