MarineEval: Assessing the Marine Intelligence of Vision-Language Models
YuK-Kwan Wong, Tuan-An To, Jipeng Zhang, Ziqiang Zheng, Sai-Kit Yeung

TL;DR
MarineEval introduces a large-scale marine-specific VLM dataset and benchmark, revealing current models' limitations in domain-specific marine question answering and highlighting the need for further advancements.
Contribution
This work creates the first comprehensive marine VLM dataset and benchmark, enabling evaluation of VLMs in marine domain expertise and identifying their current shortcomings.
Findings
Existing VLMs perform poorly on marine domain questions.
The dataset covers diverse tasks and capacities, ensuring broad evaluation.
Significant room for improvement in domain-specific VLM performance.
Abstract
We have witnessed promising progress led by large language models (LLMs) and further vision language models (VLMs) in handling various queries as a general-purpose assistant. VLMs, as a bridge to connect the visual world and language corpus, receive both visual content and various text-only user instructions to generate corresponding responses. Though great success has been achieved by VLMs in various fields, in this work, we ask whether the existing VLMs can act as domain experts, accurately answering marine questions, which require significant domain expertise and address special domain challenges/requirements. To comprehensively evaluate the effectiveness and explore the boundary of existing VLMs, we construct the first large-scale marine VLM dataset and benchmark called MarineEval, with 2,000 image-based question-answering pairs. During our dataset construction, we ensure the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
