SERPENT-VLM : Self-Refining Radiology Report Generation Using Vision Language Models
Manav Nitin Kapadnis, Sohan Patnaik, Abhilash Nandy, Sourjyadip Ray,, Pawan Goyal, Debdoot Sheet

TL;DR
SERPENT-VLM introduces a self-refining mechanism in vision-language models to improve radiology report generation, significantly reducing hallucinations and achieving state-of-the-art results on key datasets.
Contribution
It presents a novel self-supervised loss and dynamic interaction framework that enhances image-text alignment in radiology report generation models.
Findings
Outperforms existing models like LLaVA-Med and BiomedGPT.
Achieves state-of-the-art performance on IU X-ray and ROCO datasets.
Demonstrates robustness against noisy images.
Abstract
Radiology Report Generation (R2Gen) demonstrates how Multi-modal Large Language Models (MLLMs) can automate the creation of accurate and coherent radiological reports. Existing methods often hallucinate details in text-based reports that don't accurately reflect the image content. To mitigate this, we introduce a novel strategy, SERPENT-VLM (SElf Refining Radiology RePort GENeraTion using Vision Language Models), which improves the R2Gen task by integrating a self-refining mechanism into the MLLM framework. We employ a unique self-supervised loss that leverages similarity between pooled image representations and the contextual representations of the generated radiological text, alongside the standard Causal Language Modeling objective, to refine image-text representations. This allows the model to scrutinize and align the generated text through dynamic interaction between a given image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies
MethodsALIGN
