Learnable Retrieval Enhanced Visual-Text Alignment and Fusion for Radiology Report Generation

Qin Zhou; Guoyan Liang; Xindi Li; Jingyuan Chen; Wang Zhe; Chang Yao; Sai Wu

arXiv:2507.07568·stat.ME·July 11, 2025

Learnable Retrieval Enhanced Visual-Text Alignment and Fusion for Radiology Report Generation

Qin Zhou, Guoyan Liang, Xindi Li, Jingyuan Chen, Wang Zhe, Chang Yao, Sai Wu

PDF

Open Access

TL;DR

REVTAF introduces a novel framework that enhances radiology report generation by combining learnable retrieval, fine-grained visual-text alignment, and dynamic fusion, effectively addressing class imbalance and improving cross-modal integration.

Contribution

The paper proposes REVTAF, a new framework integrating adaptive retrieval and optimal transport-based alignment for improved report generation under weak supervision.

Findings

01

Outperforms state-of-the-art methods with 7.4% improvement on MIMIC-CXR

02

Achieves 2.9% higher accuracy on IU X-Ray dataset

03

Demonstrates superiority over mainstream multimodal LLMs in radiology report tasks

Abstract

Automated radiology report generation is essential for improving diagnostic efficiency and reducing the workload of medical professionals. However, existing methods face significant challenges, such as disease class imbalance and insufficient cross-modal fusion. To address these issues, we propose the learnable Retrieval Enhanced Visual-Text Alignment and Fusion (REVTAF) framework, which effectively tackles both class imbalance and visual-text fusion in report generation. REVTAF incorporates two core components: (1) a Learnable Retrieval Enhancer (LRE) that utilizes semantic hierarchies from hyperbolic space and intra-batch context through a ranking-based metric. LRE adaptively retrieves the most relevant reference reports, enhancing image representations, particularly for underrepresented (tail) class inputs; and (2) a fine-grained visual-text alignment and fusion strategy that ensures…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning