OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

Yida Xue; Ningyu Zhang; Tingwei Wu; Zhe Ma; Daxiong Ji; Zhao Wang; Guozhou Zheng; Huajun Chen

arXiv:2605.00877·cs.MM·May 7, 2026

OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

Yida Xue, Ningyu Zhang, Tingwei Wu, Zhe Ma, Daxiong Ji, Zhao Wang, Guozhou Zheng, Huajun Chen

PDF

1 Repo

TL;DR

OceanPile introduces a comprehensive multimodal ocean dataset to enhance foundation models in marine AI, addressing data fragmentation and alignment challenges with high-quality, diverse data sources.

Contribution

The paper presents OceanPile, a large-scale, multimodal ocean dataset with a novel pipeline for data synthesis, quality control, and evaluation, tailored for marine artificial intelligence.

Findings

01

Models trained on OceanPile show significant performance improvements.

02

The dataset enables better semantic alignment across ocean data modalities.

03

OceanPile is publicly available to foster marine AI research.

Abstract

The vast and underexplored ocean plays a critical role in regulating global climate and supporting marine biodiversity, yet artificial intelligence has so far delivered limited impact in this domain due to a fundamental data bottleneck. Specifically, ocean data are highly fragmented across disparate sources and inherently exhibit multi-modal, high-noise, and weakly labeled characteristics, lacking unified schemas and semantic alignment. Although Multimodal Large Language Models (MLLMs) have achieved remarkable success in general domains, their application to ocean science remains severely constrained by the absence of large-scale, well-aligned multimodal datasets tailored to marine environments. To bridge this gap, we introduce OceanPile, a large-scale multimodal corpus designed for ocean foundation models. It comprises three key components: OceanCorpus, a unified collection integrating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

oceangpt/OceanPile
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.