DREAM: Dynamic Retinal Enhancement with Adaptive Multi-modal Fusion for Expert Precision Medical Report Generation

Nagur Shareef Shaik; Teja Krishna Cherukuri; Dong Hye Ye

arXiv:2604.17209·cs.CV·April 21, 2026

DREAM: Dynamic Retinal Enhancement with Adaptive Multi-modal Fusion for Expert Precision Medical Report Generation

Nagur Shareef Shaik, Teja Krishna Cherukuri, Dong Hye Ye

PDF

TL;DR

DREAM is a novel framework for medical report generation from retinal images that effectively combines visual data and clinical keywords through a two-stage adaptive fusion process, excelling with limited data and achieving state-of-the-art results.

Contribution

The paper introduces DREAM, a two-stage adaptive multi-modal fusion framework that enhances retinal image report generation with limited data, integrating visual and clinical information effectively.

Findings

01

Achieves a BLEU-4 score of 0.241 on DeepEyeNet benchmark.

02

Demonstrates strong generalization to the ROCO dataset.

03

Sets new state-of-the-art performance in medical report generation for retinal images.

Abstract

Automating medical reports for retinal images requires a sophisticated blend of visual pattern recognition and deep clinical knowledge. Current Large Vision-Language Models (LVLMs) often struggle in specialized medical fields where data is scarce, leading to models that overfit and miss subtle but critical pathologies. To address this, we introduce DREAM (Dynamic Retinal Enhancement with Adaptive Multi-modal Fusion), a novel framework for high-fidelity medical report generation that excels even with limited data. DREAM employs a unique two-stage fusion mechanism that intelligently integrates visual data with clinical keywords curated by ophthalmologists. First, the Abstractor module maps image and keyword features into a shared space, enhancing visual data with pathology-relevant insights. Next, the Adaptor performs adaptive multi-modal fusion, dynamically weighting the importance of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.