Vision-Language Models for Automated Chest X-ray Interpretation:   Leveraging ViT and GPT-2

Md. Rakibul Islam; Md. Zahid Hossain; Mustofa Ahmed; Most. Sharmin; Sultana Samu

arXiv:2501.12356·cs.CV·January 22, 2025

Vision-Language Models for Automated Chest X-ray Interpretation: Leveraging ViT and GPT-2

Md. Rakibul Islam, Md. Zahid Hossain, Mustofa Ahmed, Most. Sharmin, Sultana Samu

PDF

Open Access

TL;DR

This paper evaluates multimodal AI models combining vision transformers and language models to automate and improve the accuracy of radiology report generation from chest X-ray images.

Contribution

It compares different combinations of vision transformers and language models, identifying the most effective model for automated radiology report generation.

Findings

01

SWIN Transformer-BART outperforms other models in report quality metrics

02

Multimodal models significantly reduce report generation time

03

The study demonstrates the potential for AI to assist radiologists

Abstract

Radiology plays a pivotal role in modern medicine due to its non-invasive diagnostic capabilities. However, the manual generation of unstructured medical reports is time consuming and prone to errors. It creates a significant bottleneck in clinical workflows. Despite advancements in AI-generated radiology reports, challenges remain in achieving detailed and accurate report generation. In this study we have evaluated different combinations of multimodal models that integrate Computer Vision and Natural Language Processing to generate comprehensive radiology reports. We employed a pretrained Vision Transformer (ViT-B16) and a SWIN Transformer as the image encoders. The BART and GPT-2 models serve as the textual decoders. We used Chest X-ray images and reports from the IU-Xray dataset to evaluate the usability of the SWIN Transformer-BART, SWIN Transformer-GPT-2, ViT-B16-BART and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · COVID-19 diagnosis using AI · Machine Learning in Healthcare

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Adam · Softmax · Absolute Position Encodings · Dropout · Byte Pair Encoding · Attention Dropout · Linear Layer