CLIP-RL: Surgical Scene Segmentation Using Contrastive Language-Vision Pretraining & Reinforcement Learning

Fatmaelzahraa Ali Ahmed; Muhammad Arsalan; Abdulaziz Al-Ali; Khalid Al-Jalham; Shidin Balakrishnan

arXiv:2507.04317·eess.IV·July 8, 2025

CLIP-RL: Surgical Scene Segmentation Using Contrastive Language-Vision Pretraining & Reinforcement Learning

Fatmaelzahraa Ali Ahmed, Muhammad Arsalan, Abdulaziz Al-Ali, Khalid Al-Jalham, Shidin Balakrishnan

PDF

TL;DR

CLIP-RL is a novel surgical scene segmentation model that combines contrastive language-vision pretraining with reinforcement learning and curriculum learning, achieving state-of-the-art results on EndoVis datasets.

Contribution

Introduces CLIP-RL, a new segmentation approach integrating contrastive pretraining with reinforcement and curriculum learning for improved surgical scene analysis.

Findings

01

Achieved 81% mean IoU on EndoVis 2018 dataset.

02

Outperformed existing models with a 74.12% mean IoU on EndoVis 2017.

03

Demonstrated robustness under occlusions, texture variations, and dynamic lighting.

Abstract

Understanding surgical scenes can provide better healthcare quality for patients, especially with the vast amount of video data that is generated during MIS. Processing these videos generates valuable assets for training sophisticated models. In this paper, we introduce CLIP-RL, a novel contrastive language-image pre-training model tailored for semantic segmentation for surgical scenes. CLIP-RL presents a new segmentation approach which involves reinforcement learning and curriculum learning, enabling continuous refinement of the segmentation masks during the full training pipeline. Our model has shown robust performance in different optical settings, such as occlusions, texture variations, and dynamic lighting, presenting significant challenges. CLIP model serves as a powerful feature extractor, capturing rich semantic context that enhances the distinction between instruments and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.