AquaticCLIP: A Vision-Language Foundation Model for Underwater Scene   Analysis

Basit Alawode; Iyyakutti Iyappan Ganapathi; Sajid Javed; Naoufel; Werghi; Mohammed Bennamoun; and Arif Mahmood

arXiv:2502.01785·cs.CV·February 5, 2025

AquaticCLIP: A Vision-Language Foundation Model for Underwater Scene Analysis

Basit Alawode, Iyyakutti Iyappan Ganapathi, Sajid Javed, Naoufel, Werghi, Mohammed Bennamoun, and Arif Mahmood

PDF

Open Access

TL;DR

AquaticCLIP is a novel contrastive vision-language model designed for underwater scene understanding, enabling various tasks without ground-truth annotations by leveraging a large-scale underwater image-text dataset.

Contribution

The paper introduces AquaticCLIP, a new unsupervised framework with a prompt-guided vision encoder for underwater scene analysis, setting a new benchmark in aquatic vision-language tasks.

Findings

01

Outperforms existing methods in zero-shot underwater vision tasks

02

Enables segmentation, classification, detection, and object counting without annotations

03

Constructed a 2 million image-text paired dataset from diverse sources

Abstract

The preservation of aquatic biodiversity is critical in mitigating the effects of climate change. Aquatic scene understanding plays a pivotal role in aiding marine scientists in their decision-making processes. In this paper, we introduce AquaticCLIP, a novel contrastive language-image pre-training model tailored for aquatic scene understanding. AquaticCLIP presents a new unsupervised learning framework that aligns images and texts in aquatic environments, enabling tasks such as segmentation, classification, detection, and object counting. By leveraging our large-scale underwater image-text paired dataset without the need for ground-truth annotations, our model enriches existing vision-language models in the aquatic domain. For this purpose, we construct a 2 million underwater image-text paired dataset using heterogeneous resources, including YouTube, Netflix, NatGeo, etc. To fine-tune…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Underwater Vehicles and Communication Systems · Robotics and Sensor-Based Localization

MethodsALIGN