Surgical Instruction Generation with Transformers

Jinglu Zhang; Yinyu Nie; Jian Chang; and Jian Jun Zhang

arXiv:2107.06964·cs.CV·July 20, 2021

Surgical Instruction Generation with Transformers

Jinglu Zhang, Yinyu Nie, Jian Chang, and Jian Jun Zhang

PDF

Open Access 1 Repo

TL;DR

This paper presents a transformer-based encoder-decoder model with reinforcement learning for automatic surgical instruction generation from images, outperforming existing methods on the DAISI dataset.

Contribution

It introduces a novel transformer-backed neural network architecture with reinforcement learning for surgical instruction generation, advancing multimodal understanding in surgical scenes.

Findings

01

Outperforms baseline methods on all caption metrics

02

Effective handling of multimodal surgical data

03

Demonstrates benefits of transformer architecture in this domain

Abstract

Automatic surgical instruction generation is a prerequisite towards intra-operative context-aware surgical assistance. However, generating instructions from surgical scenes is challenging, as it requires jointly understanding the surgical activity of current view and modelling relationships between visual information and textual description. Inspired by the neural machine translation and imaging captioning tasks in open domain, we introduce a transformer-backboned encoder-decoder network with self-critical reinforcement learning to generate instructions from surgical images. We evaluate the effectiveness of our method on DAISI dataset, which includes 290 procedures from various medical disciplines. Our approach outperforms the existing baseline over all caption evaluation metrics. The results demonstrate the benefits of the encoder-decoder structure backboned by transformer in handling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xumengyaamy/swinmlp_trancap
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques