Catch Me If You Can Describe Me: Open-Vocabulary Camouflaged Instance Segmentation with Diffusion

Tuan-Anh Vu; Duc Thanh Nguyen; Qing Guo; Nhat Chung; Binh-Son Hua; Ivor W. Tsang; Sai-Kit Yeung

arXiv:2312.17505·cs.CV·March 5, 2026·2 cites

Catch Me If You Can Describe Me: Open-Vocabulary Camouflaged Instance Segmentation with Diffusion

Tuan-Anh Vu, Duc Thanh Nguyen, Qing Guo, Nhat Chung, Binh-Son Hua, Ivor W. Tsang, Sai-Kit Yeung

PDF

Open Access

TL;DR

This paper introduces a novel diffusion-based method for open-vocabulary camouflaged instance segmentation, effectively learning multi-scale textual-visual features to distinguish camouflaged objects from complex backgrounds.

Contribution

It proposes a new approach leveraging diffusion models and cross-domain feature fusion to improve camouflaged object segmentation in open-vocabulary settings.

Findings

01

Outperforms existing methods on benchmark datasets.

02

Effectively segments unseen object classes.

03

Enhances detection of camouflaged objects in complex scenes.

Abstract

Text-to-image diffusion techniques have shown exceptional capabilities in producing high-quality, dense visual predictions from open-vocabulary text. This indicates a strong correlation between visual and textual domains in open concepts and that diffusion-based text-to-image models can capture rich and diverse information for computer vision tasks. However, we found that those advantages do not hold for learning of features of camouflaged individuals because of the significant blending between their visual boundaries and their surroundings. In this paper, while leveraging the benefits of diffusion-based techniques and text-image models in open-vocabulary settings, we aim to address a challenging problem in computer vision: open-vocabulary camouflaged instance segmentation (OVCIS). Specifically, we propose a method built upon state-of-the-art diffusion empowered by open-vocabulary to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training · Diffusion