Longer Version for "Deep Context-Encoding Network for Retinal Image Captioning"
Jia-Hong Huang, Ting-Wei Wu, Chao-Han Huck Yang, Marcel Worring

TL;DR
This paper introduces a novel context-driven encoding network that effectively integrates image and keyword information to generate accurate and meaningful medical reports for retinal images, outperforming existing models.
Contribution
A new multi-modal encoder-decoder model that leverages interactive image and keyword information for improved retinal image report generation.
Findings
Achieves state-of-the-art performance on medical report metrics
Improves BLEU-avg by 16%, CIDEr by 10.2%, ROUGE by 8.6%
Effectively leverages image and keyword interaction
Abstract
Automatically generating medical reports for retinal images is one of the promising ways to help ophthalmologists reduce their workload and improve work efficiency. In this work, we propose a new context-driven encoding network to automatically generate medical reports for retinal images. The proposed model is mainly composed of a multi-modal input encoder and a fused-feature decoder. Our experimental results show that our proposed method is capable of effectively leveraging the interactive information between the input image and context, i.e., keywords in our case. The proposed method creates more accurate and meaningful reports for retinal images than baseline models and achieves state-of-the-art performance. This performance is shown in several commonly used metrics for the medical report generation task: BLEU-avg (+16%), CIDEr (+10.2%), and ROUGE (+8.6%).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
