Anatomical Structure-Guided Medical Vision-Language Pre-training
Qingqiu Li, Xiaohan Yan, Jilan Xu, Runtian Yuan, Yuejie Zhang, Rui, Feng, Quanli Shen, Xiaobo Zhang, Shujun Wang

TL;DR
This paper introduces an Anatomical Structure-Guided framework for medical vision-language pre-training that enhances interpretability and clinical relevance by leveraging anatomical parsing and fine-grained alignment.
Contribution
It proposes a novel anatomical structure-guided approach with report parsing, anatomical region-sentence alignment, and image-tag recognition to improve medical visual representation learning.
Findings
Outperforms state-of-the-art methods on five benchmarks.
Enhances local interpretability and semantic alignment.
Improves downstream task performance.
Abstract
Learning medical visual representations through vision-language pre-training has reached remarkable progress. Despite the promising performance, it still faces challenges, i.e., local alignment lacks interpretability and clinical relevance, and the insufficient internal and external representation learning of image-report pairs. To address these issues, we propose an Anatomical Structure-Guided (ASG) framework. Specifically, we parse raw reports into triplets <anatomical region, finding, existence>, and fully utilize each element as supervision to enhance representation learning. For anatomical region, we design an automatic anatomical region-sentence alignment paradigm in collaboration with radiologists, considering them as the minimum semantic units to explore fine-grained local alignment. For finding and existence, we regard them as image tags, applying an image-tag recognition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical and Biological Sciences
MethodsContrastive Learning
