Libra: Leveraging Temporal Images for Biomedical Radiology Analysis
Xi Zhang, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho

TL;DR
Libra is a novel multimodal large language model designed for chest X-ray report generation that effectively captures temporal differences between current and prior images, improving report accuracy.
Contribution
Libra introduces a temporal-aware architecture with a specialized Temporal Alignment Connector for enhanced medical image analysis.
Findings
Achieves state-of-the-art performance on MIMIC-CXR dataset.
Effectively captures temporal differences in chest X-ray images.
Improves clinical relevance and lexical accuracy in reports.
Abstract
Radiology report generation (RRG) requires advanced medical image analysis, effective temporal reasoning, and accurate text generation. While multimodal large language models (MLLMs) align with pre-trained vision encoders to enhance visual-language understanding, most existing methods rely on single-image analysis or rule-based heuristics to process multiple images, failing to fully leverage temporal information in multi-modal medical datasets. In this paper, we introduce Libra, a temporal-aware MLLM tailored for chest X-ray report generation. Libra combines a radiology-specific image encoder with a novel Temporal Alignment Connector (TAC), designed to accurately capture and integrate temporal differences between paired current and prior images. Extensive experiments on the MIMIC-CXR dataset demonstrate that Libra establishes a new state-of-the-art benchmark among similarly scaled…
Peer Reviews
Decision·Submitted to ICLR 2025
1- The clinical problem statement is fair and important 2- The evaluation is good and comprehensive and ablation studies showed how the method behaves in difference scenarios 3- The authors introduced an interesting technical local and global learning mechanism
1- The main claim of the paper is that it is the first to introduce a VLM for automatic report generation that utilizes temporal scans to ensure more realistic reports that learn from multiple scans acquired at different time points. Regardless of whether it is a VLM or other types of encoder/decoder nets, this claim is not true because multiple works have been published to address this problem. For instance, -https://aclanthology.org/2023.findings-emnlp.325/ -https://aclanthology.org/2023.fin
* Innovative temporal processing. The TAC module is a novel addition that allows Libra to capture and utilize temporal changes in medical images effectively, enhancing the model's clinical applicability. * Comprehensive ablation studies. The ablation experiments clarify the importance of each submodule (TFM, LFE, and PIPB), reinforcing the credibility of the design choices. * Comprehensive appendix. The appendix is highly commendable, providing detailed descriptions of the datasets, training con
Despite the paper's clarity, several imprecise arguments and overstatements necessitate revision and clarification: * Incomplete framework representation. TFM is a crucial component of the core TAC, but the framework diagram omits the illustration of the $MLP_{final}$ part within TFM. This omission may lead to ambiguity regarding the final processing steps, making it more difficult for readers to fully understand how all modules are integrated within the model. It is recommended that the authors
1. The paper presents an innovative approach for handling prior study citations across various time points in report generation tasks. 2. The development of the Temporal Alignment Connector showcases a sophisticated method for capturing and integrating temporal information across multiple images. 3. A comprehensive experimental analysis, including ablation studies and qualitative comparisons, is provided to validate the effectiveness of the proposed methods.
1. Comparative Results: The comparative results do not convincingly demonstrate Libra's superiority. Although MIMIC-Diff-VQA is derived from MIMIC-CXR, the comparison seems unbalanced, as Libra was trained on both MIMIC-CXR and MIMIC-Diff-VQA, while the other model was only trained on MIMIC-CXR. 2. Effectiveness of the Temporal Alignment Connector: The authors overstate the effectiveness of the Temporal Alignment Connector (e.g., "significant enhancements across all metrics" in line 398). While
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in cancer detection
MethodsFocus
