PlankFormer: Robust Plankton Instance Segmentation via MAE-Pretrained Vision Transformers and Pseudo Community Image Generation

Masaharu Miyazaki; Yurie Otake; Koichi Ito; Wataru Makino; Jotaro Urabe; Takafumi Aoki

arXiv:2604.17856·cs.CV·April 21, 2026

PlankFormer: Robust Plankton Instance Segmentation via MAE-Pretrained Vision Transformers and Pseudo Community Image Generation

Masaharu Miyazaki, Yurie Otake, Koichi Ito, Wataru Makino, Jotaro Urabe, Takafumi Aoki

PDF

TL;DR

PlankFormer is a novel framework that combines MAE-pretrained Vision Transformers and pseudo community image generation to improve plankton instance segmentation in complex aquatic images.

Contribution

The paper introduces a synthetic data generation method and a MAE-pretrained Vision Transformer model for robust plankton segmentation, reducing reliance on manual annotations.

Findings

01

Outperforms Mask R-CNN in high debris environments

02

Uses synthetic images to train effectively with less manual annotation

03

Employs MAE pretraining for better global feature capture

Abstract

Plankton monitoring is essential for assessing aquatic ecosystems but is limited by the labor-intensive nature of manual microscopic analysis. Automating the segmentation of plankton from crowded images is crucial, however, it faces two major challenges: (i) the scarcity of pixel-level annotated datasets and (ii) the difficulty of distinguishing plankton from debris and overlapping individuals using conventional CNN-based methods. To address these issues, we propose PlankFormer, a novel framework for plankton instance segmentation. First, to overcome the data shortage, we introduce a method to generate labeled Pseudo Community Images (PCI) by synthesizing individual plankton images onto diverse backgrounds, including those created by generative models. Second, we propose a segmentation model utilizing a Vision Transformer (ViT) backbone with a Mask2Former decoder. To robustly capture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.