Towards Real Zero-Shot Camouflaged Object Segmentation without Camouflaged Annotations

Cheng Lei; Jie Fan; Xinran Li; Tianzhu Xiang; Ao Li; Ce Zhu; Le Zhang

arXiv:2410.16953·cs.CV·March 3, 2026

Towards Real Zero-Shot Camouflaged Object Segmentation without Camouflaged Annotations

Cheng Lei, Jie Fan, Xinran Li, Tianzhu Xiang, Ao Li, Ce Zhu, Le Zhang

PDF

Open Access

TL;DR

This paper introduces a novel zero-shot camouflaged object segmentation framework that leverages semantic features and multimodal models, achieving state-of-the-art results without requiring any manual annotations.

Contribution

The proposed framework is the first to enable zero-shot COS using semantic transfer, multimodal alignment, and a learnable codebook for efficient inference.

Findings

01

Achieves $F_{eta}^w$ scores of 72.9% on CAMO and 71.7% on COD10K.

02

Runs at 18.1 FPS without the large language model during inference.

03

Outperforms existing methods in zero-shot camouflaged object segmentation.

Abstract

Camouflaged Object Segmentation (COS) faces significant challenges due to the scarcity of annotated data, where meticulous pixel-level annotation is both labor-intensive and costly, primarily due to the intricate object-background boundaries. Addressing the core question, "Can COS be effectively achieved in a zero-shot manner without manual annotations for any camouflaged object?" we affirmatively respond and introduce a robust zero-shot COS framework. This framework leverages the inherent local pattern bias of COS and employs a broad semantic feature space derived from salient object segmentation (SOS) for efficient zero-shot transfer. We incorporate an Masked Image Modeling (MIM) based image encoder optimized for Parameter-Efficient Fine-Tuning (PEFT), a Multimodal Large Language Model (M-LLM), and a Multi-scale Fine-grained Alignment (MFA) mechanism. The MIM pre-trained image encoder…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Image Enhancement Techniques · Advanced Image and Video Retrieval Techniques

MethodsMutual Information Machine/Mask Image Modeling · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings