RIHA: Report-Image Hierarchical Alignment for Radiology Report Generation
Yucheng Chen, Yang Yu, Yufei Shi, Conghao Xiong, Xulei Yang, and Si Yong Yeo

TL;DR
RIHA introduces a hierarchical alignment transformer for radiology report generation, enabling multi-level cross-modal mapping between images and reports to improve accuracy.
Contribution
It proposes a novel multi-level alignment framework with visual and textual pyramids and a hierarchical alignment module using optimal transport.
Findings
Outperforms state-of-the-art models on IU-Xray and MIMIC-CXR datasets.
Enhances report accuracy by capturing semantic hierarchies and spatial relationships.
Improves natural language generation quality and clinical relevance.
Abstract
Radiology report generation (RRG) has emerged as a promising approach to alleviate radiologists' workload and reduce human errors by automatically generating diagnostic reports from medical images. A key challenge in RRG is achieving fine-grained alignment between complex visual features and the hierarchical structure of long-form radiology reports. Although recent methods have improved image-text representation learning, they often treat reports as flat sequences, overlooking their structured sections and semantic hierarchies. This simplification hinders precise cross-modal alignment and weakens RRG accuracy. To address this challenge, we propose RIHA (Report-Image Hierarchical Alignment Transformer), a novel end-to-end framework that performs multi-level alignment between radiological images and their corresponding reports across paragraph, sentence, and word levels. This hierarchical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
