Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration

Jun Wang; Lixing Zhu; Xiaohan Yu; Abhir Bhalerao; and Yulan He

arXiv:2506.10573·cs.CV·December 9, 2025

Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration

Jun Wang, Lixing Zhu, Xiaohan Yu, Abhir Bhalerao, and Yulan He

PDF

TL;DR

This paper introduces PLACE, a novel framework for medical visual representation learning that emphasizes pathological-level cross-modal alignment and correlation exploration, improving performance across multiple medical imaging tasks without extra annotations.

Contribution

The paper proposes a new pathological-level cross-modal alignment method and a correlation exploration proxy task, enhancing medical visual representations without relying on external disease annotations.

Findings

01

Achieves state-of-the-art results on classification tasks

02

Improves image-to-text retrieval performance

03

Enhances semantic segmentation and object detection accuracy

Abstract

Learning medical visual representations from image-report pairs through joint learning has garnered increasing research attention due to its potential to alleviate the data scarcity problem in the medical domain. The primary challenges stem from the lengthy reports that feature complex discourse relations and semantic pathologies. Previous works have predominantly focused on instance-wise or token-wise cross-modal alignment, often neglecting the importance of pathological-level consistency. This paper presents a novel framework PLACE that promotes the Pathological-Level Alignment and enriches the fine-grained details via Correlation Exploration without additional human annotations. Specifically, we propose a novel pathological-level cross-modal alignment (PCMA) approach to maximize the consistency of pathology observations from both images and reports. To facilitate this, a Visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need