GP-VLS: A general-purpose vision language model for surgery
Samuel Schmidgall, Joseph Cho, Cyril Zakka, William Hiesinger

TL;DR
GP-VLS is a versatile vision-language model for surgery that integrates medical knowledge and visual understanding, enabling broad surgical AI applications and outperforming existing models on multiple benchmarks.
Contribution
Introduction of GP-VLS, a general-purpose surgical vision-language model trained on new datasets, with a novel evaluation benchmark SurgiQual, advancing surgical AI capabilities.
Findings
GP-VLS outperforms existing models by 8-21% on surgical benchmarks.
GP-VLS demonstrates strong performance on medical and surgical knowledge tests.
The model is open-source, facilitating further research and development.
Abstract
Surgery requires comprehensive medical knowledge, visual assessment skills, and procedural expertise. While recent surgical AI models have focused on solving task-specific problems, there is a need for general-purpose systems that can understand surgical scenes and interact through natural language. This paper introduces GP-VLS, a general-purpose vision language model for surgery that integrates medical and surgical knowledge with visual scene understanding. For comprehensively evaluating general-purpose surgical models, we propose SurgiQual, which evaluates across medical and surgical knowledge benchmarks as well as surgical vision-language questions. To train GP-VLS, we develop six new datasets spanning medical knowledge, surgical textbooks, and vision-language pairs for tasks like phase recognition and tool identification. We show that GP-VLS significantly outperforms existing open-…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
