Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation

Sarosij Bose; Ravi K. Rajendran; Biplob Debnath; Konstantinos Karydis; Amit K. Roy-Chowdhury; Srimat Chakradhar

arXiv:2512.16201·cs.CV·March 16, 2026

Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation

Sarosij Bose, Ravi K. Rajendran, Biplob Debnath, Konstantinos Karydis, Amit K. Roy-Chowdhury, Srimat Chakradhar

PDF

Open Access

TL;DR

This paper introduces VALOR, a novel method for grounded radiology report generation that improves visual alignment and clinical accuracy by combining clinical reasoning and self-supervised visual reasoning, without extra annotations.

Contribution

VALOR presents a new two-stage reasoning approach that enhances visual grounding and clinical accuracy in radiology report generation without relying on large labeled datasets or retrieval systems.

Findings

01

Significant improvements over state-of-the-art benchmarks.

02

Enhanced clinical accuracy and visual grounding.

03

Effective reduction of hallucinations in report generation.

Abstract

Radiology Report Generation (RRG) is a critical step toward automating healthcare workflows, facilitating accurate patient assessments, and reducing the workload of medical professionals. Despite recent progress in Large Medical Vision-Language Models (Med-VLMs), generating radiology reports that are both visually grounded and clinically accurate remains a significant challenge. Existing approaches often rely on large labeled corpora for pre-training, costly task-specific preference data, or retrieval-based knowledge. However, these strategies do not adequately mitigate hallucinations arising from poor cross-modal alignment between visual and linguistic representations. To address these limitations, we propose VALOR: Visual Alignment of Medical Vision-Language Models for GrOunded Radiology Report Generation, which tackles visual hallucinations through two complementary reasoning stages:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning