TL;DR
OceanPile introduces a comprehensive multimodal ocean dataset to enhance foundation models in marine AI, addressing data fragmentation and alignment challenges with high-quality, diverse data sources.
Contribution
The paper presents OceanPile, a large-scale, multimodal ocean dataset with a novel pipeline for data synthesis, quality control, and evaluation, tailored for marine artificial intelligence.
Findings
Models trained on OceanPile show significant performance improvements.
The dataset enables better semantic alignment across ocean data modalities.
OceanPile is publicly available to foster marine AI research.
Abstract
The vast and underexplored ocean plays a critical role in regulating global climate and supporting marine biodiversity, yet artificial intelligence has so far delivered limited impact in this domain due to a fundamental data bottleneck. Specifically, ocean data are highly fragmented across disparate sources and inherently exhibit multi-modal, high-noise, and weakly labeled characteristics, lacking unified schemas and semantic alignment. Although Multimodal Large Language Models (MLLMs) have achieved remarkable success in general domains, their application to ocean science remains severely constrained by the absence of large-scale, well-aligned multimodal datasets tailored to marine environments. To bridge this gap, we introduce OceanPile, a large-scale multimodal corpus designed for ocean foundation models. It comprises three key components: OceanCorpus, a unified collection integrating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
