MSC: A Marine Wildlife Video Dataset with Grounded Segmentation and Clip-Level Captioning

Quang-Trung Truong; Yuk-Kwan Wong; Vo Hoang Kim Tuyen Dang; Rinaldi Gotama; Duc Thanh Nguyen; Sai-Kit Yeung

arXiv:2508.04549·cs.CV·September 3, 2025

MSC: A Marine Wildlife Video Dataset with Grounded Segmentation and Clip-Level Captioning

Quang-Trung Truong, Yuk-Kwan Wong, Vo Hoang Kim Tuyen Dang, Rinaldi Gotama, Duc Thanh Nguyen, Sai-Kit Yeung

PDF

TL;DR

This paper introduces MSC, a comprehensive marine wildlife video dataset with grounded segmentation and clip-level captioning, addressing the unique challenges of underwater video understanding and enabling improved marine scene analysis.

Contribution

We propose a novel two-stage marine object-oriented video captioning pipeline and a new benchmark dataset that incorporates segmentation masks for enhanced visual grounding and captioning.

Findings

01

Enhanced marine video understanding through segmentation and captioning

02

Effective detection of salient object transitions in marine scenes

03

Improved semantic richness in marine video captions

Abstract

Marine videos present significant challenges for video understanding due to the dynamics of marine objects and the surrounding environment, camera motion, and the complexity of underwater scenes. Existing video captioning datasets, typically focused on generic or human-centric domains, often fail to generalize to the complexities of the marine environment and gain insights about marine life. To address these limitations, we propose a two-stage marine object-oriented video captioning pipeline. We introduce a comprehensive video understanding benchmark that leverages the triplets of video, text, and segmentation masks to facilitate visual grounding and captioning, leading to improved marine video understanding and analysis, and marine video generation. Additionally, we highlight the effectiveness of video splitting in order to detect salient object transitions in scene changes, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.