AquaticCLIP: A Vision-Language Foundation Model for Underwater Scene Analysis
Basit Alawode, Iyyakutti Iyappan Ganapathi, Sajid Javed, Naoufel, Werghi, Mohammed Bennamoun, and Arif Mahmood

TL;DR
AquaticCLIP is a novel contrastive vision-language model designed for underwater scene understanding, enabling various tasks without ground-truth annotations by leveraging a large-scale underwater image-text dataset.
Contribution
The paper introduces AquaticCLIP, a new unsupervised framework with a prompt-guided vision encoder for underwater scene analysis, setting a new benchmark in aquatic vision-language tasks.
Findings
Outperforms existing methods in zero-shot underwater vision tasks
Enables segmentation, classification, detection, and object counting without annotations
Constructed a 2 million image-text paired dataset from diverse sources
Abstract
The preservation of aquatic biodiversity is critical in mitigating the effects of climate change. Aquatic scene understanding plays a pivotal role in aiding marine scientists in their decision-making processes. In this paper, we introduce AquaticCLIP, a novel contrastive language-image pre-training model tailored for aquatic scene understanding. AquaticCLIP presents a new unsupervised learning framework that aligns images and texts in aquatic environments, enabling tasks such as segmentation, classification, detection, and object counting. By leveraging our large-scale underwater image-text paired dataset without the need for ground-truth annotations, our model enriches existing vision-language models in the aquatic domain. For this purpose, we construct a 2 million underwater image-text paired dataset using heterogeneous resources, including YouTube, Netflix, NatGeo, etc. To fine-tune…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Underwater Vehicles and Communication Systems · Robotics and Sensor-Based Localization
MethodsALIGN
