TL;DR
This paper presents a specialized visual language model that effectively generates detailed radiology reports from chest X-ray images by aligning vision encoders with a fine-tuned large language model.
Contribution
It introduces a novel radiology-focused visual language model that combines vision encoders with a fine-tuned LLM for accurate report generation from chest X-rays.
Findings
Model effectively generates radiology reports from chest X-rays.
Two-stage training improves alignment and report accuracy.
Demonstrates potential of multimodal LLMs in medical imaging.
Abstract
We introduce a radiology-focused visual language model designed to generate radiology reports from chest X-rays. Building on previous findings that large language models (LLMs) can acquire multimodal capabilities when aligned with pretrained vision encoders, we demonstrate similar potential with chest X-ray images. This integration enhances the ability of model to understand and describe chest X-ray images. Our model combines an image encoder with a fine-tuned LLM based on the Vicuna-7B architecture, enabling it to generate different sections of a radiology report with notable accuracy. The training process involves a two-stage approach: (i) initial alignment of chest X-ray features with the LLM (ii) followed by fine-tuning for radiology report generation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
