M3T: Multi-Modal Medical Transformer to bridge Clinical Context with   Visual Insights for Retinal Image Medical Description Generation

Nagur Shareef Shaik; Teja Krishna Cherukuri; Dong Hye Ye

arXiv:2406.13129·cs.CV·December 24, 2024

M3T: Multi-Modal Medical Transformer to bridge Clinical Context with Visual Insights for Retinal Image Medical Description Generation

Nagur Shareef Shaik, Teja Krishna Cherukuri, Dong Hye Ye

PDF

Open Access

TL;DR

This paper introduces M3T, a multi-modal transformer model that combines retinal image features with clinical keywords to generate accurate medical descriptions, significantly improving over previous methods.

Contribution

The paper presents a novel deep learning architecture that effectively integrates visual and clinical data for retinal image description generation, addressing prior limitations.

Findings

01

13.5% improvement in BLEU@4 score over baseline

02

Effective integration of visual and clinical modalities

03

Validated on DeepEyeNet dataset with ophthalmologists' standards

Abstract

Automated retinal image medical description generation is crucial for streamlining medical diagnosis and treatment planning. Existing challenges include the reliance on learned retinal image representations, difficulties in handling multiple imaging modalities, and the lack of clinical context in visual representations. Addressing these issues, we propose the Multi-Modal Medical Transformer (M3T), a novel deep learning architecture that integrates visual representations with diagnostic keywords. Unlike previous studies focusing on specific aspects, our approach efficiently learns contextual information and semantics from both modalities, enabling the generation of precise and coherent medical descriptions for retinal images. Experimental studies on the DeepEyeNet dataset validate the success of M3T in meeting ophthalmologists' standards, demonstrating a substantial 13.5% improvement in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRetinal Imaging and Analysis · AI in cancer detection · Artificial Intelligence in Healthcare

MethodsLinear Layer · Multi-Head Attention · Residual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam