Surgical Instruction Generation with Transformers
Jinglu Zhang, Yinyu Nie, Jian Chang, and Jian Jun Zhang

TL;DR
This paper presents a transformer-based encoder-decoder model with reinforcement learning for automatic surgical instruction generation from images, outperforming existing methods on the DAISI dataset.
Contribution
It introduces a novel transformer-backed neural network architecture with reinforcement learning for surgical instruction generation, advancing multimodal understanding in surgical scenes.
Findings
Outperforms baseline methods on all caption metrics
Effective handling of multimodal surgical data
Demonstrates benefits of transformer architecture in this domain
Abstract
Automatic surgical instruction generation is a prerequisite towards intra-operative context-aware surgical assistance. However, generating instructions from surgical scenes is challenging, as it requires jointly understanding the surgical activity of current view and modelling relationships between visual information and textual description. Inspired by the neural machine translation and imaging captioning tasks in open domain, we introduce a transformer-backboned encoder-decoder network with self-critical reinforcement learning to generate instructions from surgical images. We evaluate the effectiveness of our method on DAISI dataset, which includes 290 procedures from various medical disciplines. Our approach outperforms the existing baseline over all caption evaluation metrics. The results demonstrate the benefits of the encoder-decoder structure backboned by transformer in handling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
