Towards Real Zero-Shot Camouflaged Object Segmentation without Camouflaged Annotations
Cheng Lei, Jie Fan, Xinran Li, Tianzhu Xiang, Ao Li, Ce Zhu, Le Zhang

TL;DR
This paper introduces a novel zero-shot camouflaged object segmentation framework that leverages semantic features and multimodal models, achieving state-of-the-art results without requiring any manual annotations.
Contribution
The proposed framework is the first to enable zero-shot COS using semantic transfer, multimodal alignment, and a learnable codebook for efficient inference.
Findings
Achieves $F_{eta}^w$ scores of 72.9% on CAMO and 71.7% on COD10K.
Runs at 18.1 FPS without the large language model during inference.
Outperforms existing methods in zero-shot camouflaged object segmentation.
Abstract
Camouflaged Object Segmentation (COS) faces significant challenges due to the scarcity of annotated data, where meticulous pixel-level annotation is both labor-intensive and costly, primarily due to the intricate object-background boundaries. Addressing the core question, "Can COS be effectively achieved in a zero-shot manner without manual annotations for any camouflaged object?" we affirmatively respond and introduce a robust zero-shot COS framework. This framework leverages the inherent local pattern bias of COS and employs a broad semantic feature space derived from salient object segmentation (SOS) for efficient zero-shot transfer. We incorporate an Masked Image Modeling (MIM) based image encoder optimized for Parameter-Efficient Fine-Tuning (PEFT), a Multimodal Large Language Model (M-LLM), and a Multi-scale Fine-grained Alignment (MFA) mechanism. The MIM pre-trained image encoder…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Image Enhancement Techniques · Advanced Image and Video Retrieval Techniques
MethodsMutual Information Machine/Mask Image Modeling · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
