TL;DR
WOW-Seg is a novel open world image segmentation model that uses visual prompts and a new dataset to achieve high performance in recognizing and segmenting objects across a vast range of categories.
Contribution
The paper introduces WOW-Seg, a word-free open world segmentation model with a novel visual prompt module and a new large-scale region recognition dataset, improving semantic understanding in open-set scenarios.
Findings
Achieves 89.7 semantic similarity and 82.4 semantic IoU on LVIS dataset.
Surpasses previous state-of-the-art with only one-eighth the parameters.
Constructed the extensive RR-7K dataset with 7,662 classes.
Abstract
Open world image segmentation aims to achieve precise segmentation and semantic understanding of targets within images by addressing the infinitely open set of object categories encountered in the real world. However, traditional closed-set segmentation approaches struggle to adapt to complex open world scenarios, while foundation segmentation models such as SAM exhibit notable discrepancies between their strong segmentation capabilities and relatively weaker semantic understanding. To bridge these discrepancies, we propose WOW-Seg, a Word-free Open World Segmentation model for segmenting and recognizing objects from open-set categories. Specifically, WOW-Seg introduces a novel visual prompt module, Mask2Token, which transforms image masks into visual tokens and ensures their alignment with the VLLM feature space. Moreover, we introduce the Cascade Attention Mask to decouple information…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
