Enhancing Biomedical Multi-modal Representation Learning with Multi-scale Pre-training and Perturbed Report Discrimination
Xinliu Zhong, Kayhan Batmanghelich, Li Sun

TL;DR
This paper introduces a novel pre-training method for biomedical vision-language models that uses perturbed report discrimination to better capture complex domain-specific semantics, leading to improved downstream task performance.
Contribution
It proposes a new contrastive learning approach with report perturbation and multi-granularity contrast, enhancing semantic understanding in biomedical multi-modal representations.
Findings
Outperforms baseline methods on multiple downstream tasks.
Learns more semantic and robust multi-modal representations.
Improves sensitivity to higher-level semantic structures.
Abstract
Vision-language models pre-trained on large scale of unlabeled biomedical images and associated reports learn generalizable semantic representations. These multi-modal representations can benefit various downstream tasks in the biomedical domain. Contrastive learning is widely used to pre-train vision-language models for general natural images and associated captions. Despite its popularity, we found biomedical texts have complex and domain-specific semantics that are often neglected by common contrastive methods. To address this issue, we propose a novel method, perturbed report discrimination, for pre-train biomedical vision-language models. First, we curate a set of text perturbation methods that keep the same words, but disrupt the semantic structure of the sentence. Next, we apply different types of perturbation to reports, and use the model to distinguish the original report from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
