PlankFormer: Robust Plankton Instance Segmentation via MAE-Pretrained Vision Transformers and Pseudo Community Image Generation
Masaharu Miyazaki, Yurie Otake, Koichi Ito, Wataru Makino, Jotaro Urabe, Takafumi Aoki

TL;DR
PlankFormer is a novel framework that combines MAE-pretrained Vision Transformers and pseudo community image generation to improve plankton instance segmentation in complex aquatic images.
Contribution
The paper introduces a synthetic data generation method and a MAE-pretrained Vision Transformer model for robust plankton segmentation, reducing reliance on manual annotations.
Findings
Outperforms Mask R-CNN in high debris environments
Uses synthetic images to train effectively with less manual annotation
Employs MAE pretraining for better global feature capture
Abstract
Plankton monitoring is essential for assessing aquatic ecosystems but is limited by the labor-intensive nature of manual microscopic analysis. Automating the segmentation of plankton from crowded images is crucial, however, it faces two major challenges: (i) the scarcity of pixel-level annotated datasets and (ii) the difficulty of distinguishing plankton from debris and overlapping individuals using conventional CNN-based methods. To address these issues, we propose PlankFormer, a novel framework for plankton instance segmentation. First, to overcome the data shortage, we introduce a method to generate labeled Pseudo Community Images (PCI) by synthesizing individual plankton images onto diverse backgrounds, including those created by generative models. Second, we propose a segmentation model utilizing a Vision Transformer (ViT) backbone with a Mask2Former decoder. To robustly capture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
