SERPENT-VLM : Self-Refining Radiology Report Generation Using Vision   Language Models

Manav Nitin Kapadnis; Sohan Patnaik; Abhilash Nandy; Sourjyadip Ray,; Pawan Goyal; Debdoot Sheet

arXiv:2404.17912·cs.CL·July 19, 2024·2 cites

SERPENT-VLM : Self-Refining Radiology Report Generation Using Vision Language Models

Manav Nitin Kapadnis, Sohan Patnaik, Abhilash Nandy, Sourjyadip Ray,, Pawan Goyal, Debdoot Sheet

PDF

Open Access 1 Video

TL;DR

SERPENT-VLM introduces a self-refining mechanism in vision-language models to improve radiology report generation, significantly reducing hallucinations and achieving state-of-the-art results on key datasets.

Contribution

It presents a novel self-supervised loss and dynamic interaction framework that enhances image-text alignment in radiology report generation models.

Findings

01

Outperforms existing models like LLaVA-Med and BiomedGPT.

02

Achieves state-of-the-art performance on IU X-ray and ROCO datasets.

03

Demonstrates robustness against noisy images.

Abstract

Radiology Report Generation (R2Gen) demonstrates how Multi-modal Large Language Models (MLLMs) can automate the creation of accurate and coherent radiological reports. Existing methods often hallucinate details in text-based reports that don't accurately reflect the image content. To mitigate this, we introduce a novel strategy, SERPENT-VLM (SElf Refining Radiology RePort GENeraTion using Vision Language Models), which improves the R2Gen task by integrating a self-refining mechanism into the MLLM framework. We employ a unique self-supervised loss that leverages similarity between pooled image representations and the contextual representations of the generated radiological text, alongside the standard Causal Language Modeling objective, to refine image-text representations. This allows the model to scrutinize and align the generated text through dynamic interaction between a given image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SERPENT-VLM : Self-Refining Radiology Report Generation Using Vision Language Models· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies

MethodsALIGN