RepsNet: Combining Vision with Language for Automated Medical Reports

Ajay Kumar Tanwani; Joelle Barral; Daniel Freedman

arXiv:2209.13171·cs.CV·September 28, 2022

RepsNet: Combining Vision with Language for Automated Medical Reports

Ajay Kumar Tanwani, Joelle Barral, Daniel Freedman

PDF

TL;DR

RepsNet is a novel model that combines vision and language models to automate the generation of medical reports from images, improving accuracy and efficiency in medical image analysis.

Contribution

It introduces a new encoder-decoder framework that aligns images with natural language descriptions and generates reports, advancing automated medical report generation.

Findings

01

Achieves 81.08% accuracy on VQA-Rad 2018

02

Attains 0.58 BLEU-1 score on IU-Xray

03

Outperforms existing state-of-the-art methods

Abstract

Writing reports by analyzing medical images is error-prone for inexperienced practitioners and time consuming for experienced ones. In this work, we present RepsNet that adapts pre-trained vision and language models to interpret medical images and generate automated reports in natural language. RepsNet consists of an encoder-decoder model: the encoder aligns the images with natural language descriptions via contrastive learning, while the decoder predicts answers by conditioning on encoded images and prior context of descriptions retrieved by nearest neighbor search. We formulate the problem in a visual question answering setting to handle both categorical and descriptive natural language answers. We perform experiments on two challenging tasks of medical visual question answering (VQA-Rad) and report generation (IU-Xray) on radiology image datasets. Results show that RepsNet…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.