SurgVisAgent: Multimodal Agentic Model for Versatile Surgical Visual Enhancement
Zeyu Lei, Hongyuan Yu, Jinlin Wu, Zhen Chen

TL;DR
SurgVisAgent is a versatile multimodal model that dynamically enhances surgical images across various distortion types, improving decision-making in complex real-world scenarios.
Contribution
It introduces a unified, end-to-end surgical vision agent leveraging multimodal large language models for multi-task image enhancement.
Findings
Outperforms traditional single-task models in diverse enhancement tasks
Effectively identifies distortion categories and severity levels in surgical images
Demonstrates potential as a comprehensive surgical assistance tool
Abstract
Precise surgical interventions are vital to patient safety, and advanced enhancement algorithms have been developed to assist surgeons in decision-making. Despite significant progress, these algorithms are typically designed for single tasks in specific scenarios, limiting their effectiveness in complex real-world situations. To address this limitation, we propose SurgVisAgent, an end-to-end intelligent surgical vision agent built on multimodal large language models (MLLMs). SurgVisAgent dynamically identifies distortion categories and severity levels in endoscopic images, enabling it to perform a variety of enhancement tasks such as low-light enhancement, overexposure correction, motion blur elimination, and smoke removal. Specifically, to achieve superior surgical scenario understanding, we design a prior model that provides domain-specific knowledge. Additionally, through in-context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
