CLIP-RL: Surgical Scene Segmentation Using Contrastive Language-Vision Pretraining & Reinforcement Learning
Fatmaelzahraa Ali Ahmed, Muhammad Arsalan, Abdulaziz Al-Ali, Khalid Al-Jalham, Shidin Balakrishnan

TL;DR
CLIP-RL is a novel surgical scene segmentation model that combines contrastive language-vision pretraining with reinforcement learning and curriculum learning, achieving state-of-the-art results on EndoVis datasets.
Contribution
Introduces CLIP-RL, a new segmentation approach integrating contrastive pretraining with reinforcement and curriculum learning for improved surgical scene analysis.
Findings
Achieved 81% mean IoU on EndoVis 2018 dataset.
Outperformed existing models with a 74.12% mean IoU on EndoVis 2017.
Demonstrated robustness under occlusions, texture variations, and dynamic lighting.
Abstract
Understanding surgical scenes can provide better healthcare quality for patients, especially with the vast amount of video data that is generated during MIS. Processing these videos generates valuable assets for training sophisticated models. In this paper, we introduce CLIP-RL, a novel contrastive language-image pre-training model tailored for semantic segmentation for surgical scenes. CLIP-RL presents a new segmentation approach which involves reinforcement learning and curriculum learning, enabling continuous refinement of the segmentation masks during the full training pipeline. Our model has shown robust performance in different optical settings, such as occlusions, texture variations, and dynamic lighting, presenting significant challenges. CLIP model serves as a powerful feature extractor, capturing rich semantic context that enhances the distinction between instruments and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
