TL;DR
FA-Seg is a novel diffusion-based framework for open-vocabulary segmentation that achieves high accuracy and efficiency without training, leveraging a minimal (1+1)-step process and innovative attention refinement techniques.
Contribution
It introduces a training-free, diffusion model-based segmentation method with dual-prompt attention, hierarchical refinement, and test-time flipping for improved open-vocabulary segmentation.
Findings
Achieves 43.8% average mIoU on multiple benchmarks.
Operates with only a (1+1)-step process from a pretrained diffusion model.
Maintains high inference efficiency while surpassing state-of-the-art performance.
Abstract
Open-vocabulary semantic segmentation (OVSS) aims to segment objects from arbitrary text categories without requiring densely annotated datasets. Although contrastive learning based models enable zero-shot segmentation, they often lose fine spatial precision at pixel level, due to global representation bias. In contrast, diffusion-based models naturally encode fine-grained spatial features via attention mechanisms that capture both global context and local details. However, they often face challenges in balancing the computation costs and the quality of the segmentation mask. In this work, we present FA-Seg, a Fast and Accurate training-free framework for open-vocabulary segmentation based on diffusion models. FA-Seg performs segmentation using only a (1+1)-step from a pretrained diffusion model. Moreover, instead of running multiple times for different classes, FA-Seg performs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
