Vision-Language Models for Automated Chest X-ray Interpretation: Leveraging ViT and GPT-2
Md. Rakibul Islam, Md. Zahid Hossain, Mustofa Ahmed, Most. Sharmin, Sultana Samu

TL;DR
This paper evaluates multimodal AI models combining vision transformers and language models to automate and improve the accuracy of radiology report generation from chest X-ray images.
Contribution
It compares different combinations of vision transformers and language models, identifying the most effective model for automated radiology report generation.
Findings
SWIN Transformer-BART outperforms other models in report quality metrics
Multimodal models significantly reduce report generation time
The study demonstrates the potential for AI to assist radiologists
Abstract
Radiology plays a pivotal role in modern medicine due to its non-invasive diagnostic capabilities. However, the manual generation of unstructured medical reports is time consuming and prone to errors. It creates a significant bottleneck in clinical workflows. Despite advancements in AI-generated radiology reports, challenges remain in achieving detailed and accurate report generation. In this study we have evaluated different combinations of multimodal models that integrate Computer Vision and Natural Language Processing to generate comprehensive radiology reports. We employed a pretrained Vision Transformer (ViT-B16) and a SWIN Transformer as the image encoders. The BART and GPT-2 models serve as the textual decoders. We used Chest X-ray images and reports from the IU-Xray dataset to evaluate the usability of the SWIN Transformer-BART, SWIN Transformer-GPT-2, ViT-B16-BART and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · COVID-19 diagnosis using AI · Machine Learning in Healthcare
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Adam · Softmax · Absolute Position Encodings · Dropout · Byte Pair Encoding · Attention Dropout · Linear Layer
