Cross-modal Prototype Driven Network for Radiology Report Generation
Jun Wang, Abhir Bhalerao, and Yulan He

TL;DR
This paper introduces XPRONET, a novel cross-modal prototype network that enhances radiology report generation by learning and exploiting cross-modal patterns, significantly improving performance on key benchmarks.
Contribution
The paper proposes a new cross-modal prototype driven network with three modules, advancing feature interaction and learning in radiology report generation.
Findings
Outperforms recent methods on IU-Xray benchmark
Achieves comparable results on MIMIC-CXR
Enhances multi-label prototype learning with contrastive loss
Abstract
Radiology report generation (RRG) aims to describe automatically a radiology image with human-like language and could potentially support the work of radiologists, reducing the burden of manual reporting. Previous approaches often adopt an encoder-decoder architecture and focus on single-modal feature learning, while few studies explore cross-modal feature interaction. Here we propose a Cross-modal PROtotype driven NETwork (XPRONET) to promote cross-modal pattern learning and exploit it to improve the task of radiology report generation. This is achieved by three well-designed, fully differentiable and complementary modules: a shared cross-modal prototype matrix to record the cross-modal prototypes; a cross-modal prototype network to learn the cross-modal prototypes and embed the cross-modal information into the visual and textual features; and an improved multi-label contrastive loss…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Topic Modeling
