S2D-ALIGN: Shallow-to-Deep Auxiliary Learning for Anatomically-Grounded Radiology Report Generation
Jiechao Gao, Chang Liu, Yuangang Li

TL;DR
This paper introduces S2D-Align, a novel training paradigm for radiology report generation that progressively grounds reports in anatomical details using auxiliary signals, leading to improved alignment and report quality.
Contribution
S2D-Align employs a shallow-to-deep auxiliary learning strategy with a memory-based adapter to enhance anatomically-grounded alignment in radiology report generation.
Findings
Achieves state-of-the-art results on MIMIC-CXR and IU X-Ray datasets.
Demonstrates the effectiveness of multi-stage, auxiliary-guided alignment.
Validates the approach through comprehensive ablation studies.
Abstract
Radiology Report Generation (RRG) aims to automatically generate diagnostic reports from radiology images. To achieve this, existing methods have leveraged the powerful cross-modal generation capabilities of Multimodal Large Language Models (MLLMs), primarily focusing on optimizing cross-modal alignment between radiographs and reports through Supervised Fine-Tuning (SFT). However, by only performing instance-level alignment with the image-text pairs, the standard SFT paradigm fails to establish anatomically-grounded alignment, where the templated nature of reports often leads to sub-optimal generation quality. To address this, we propose \textsc{S2D-Align}, a novel SFT paradigm that establishes anatomically-grounded alignment by leveraging auxiliary signals of varying granularities. \textsc{S2D-Align} implements a shallow-to-deep strategy, progressively enriching the alignment process:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
