Granular Computing-driven SAM: From Coarse-to-Fine Guidance for Prompt-Free Segmentation
Qiyang Yu, Yu Fang, Tianrui Li, Xuemei Cao, Yan Chen, Jianghao Li, Fan Min, Yi Zhang

TL;DR
Grc-SAM introduces a coarse-to-fine, granular computing framework for prompt-free image segmentation, improving localization, detail modeling, and scalability by integrating multi-granularity attention with vision transformers.
Contribution
It presents a novel multi-granularity, coarse-to-fine segmentation framework that automates prompt generation and enhances high-resolution detail modeling.
Findings
Outperforms baseline methods in accuracy.
Demonstrates improved scalability and detail modeling.
Effectively automates prompt generation.
Abstract
Prompt-free image segmentation aims to generate accurate masks without manual guidance. Typical pre-trained models, notably Segmentation Anything Model (SAM), generate prompts directly at a single granularity level. However, this approach has two limitations: (1) Localizability, lacking mechanisms for autonomous region localization; (2) Scalability, limited fine-grained modeling at high resolution. To address these challenges, we introduce Granular Computing-driven SAM (Grc-SAM), a coarse-to-fine framework motivated by Granular Computing (GrC). First, the coarse stage adaptively extracts high-response regions from features to achieve precise foreground localization and reduce reliance on external prompts. Second, the fine stage applies finer patch partitioning with sparse local swin-style attention to enhance detail modeling and enable high-resolution segmentation. Third, refined masks…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The authors propose a coarse-to-fine SAM-based model to eliminate manual prompt requirements. 2. Multi-level attention maps are leveraged to localize target regions. 3. A Swin-style transformer is employed for fine-grained mask generation.
1. SAM and SAM 2 lack semantic label prediction, and therefore cannot properly evaluate mIoU or PA on multi-class datasets such as ADE20K and PASCAL VOC 2012. Similar to SAM Automatic Mask Generation (SAM-AMG), they only segment all possible objects without assigning semantic labels. Consequently, the mIoU and PA results reported for SAM and SAM 2 in Tables 1 and 2 are not accurate. 2. I attempted to identify how the proposed method enables semantic label prediction. If my understanding is corre
Below are the strength of this paper: 1. Conceptual novelty: The paper proposes prompt internalization, moving beyond simple auto-prompt generation toward integrating granularity control directly into SAM. 2. Efficiency improvement: The hierarchical coarse-to-fine structure empirically reduces FLOPs by ~44% and latency by ~7.5×, indicating a tangible computational advantage. 3. Clear motivation: The paper’s intuition, focusing computation where semantic importance is high, is sound and aligns w
1. Comparisons are limited. The evaluation omits stronger prompt-generation baselines (e.g., AoP-SAM [1]) and thus it is difficult to quantify the claimed advantages. Also, it is not so clear whether integrating prompt generation inside SAM leads to inherently superior modeling compared to external prompt generation. The evidence is partially convincing, but the experiments requires comparisons with more recent prompt-free or auto-prompt baselines. 2. Accuracy gain is marginal. Improvements over
1. The topic is interesting and relevant, as it focuses on improving automation and efficiency in segmentation through a new prompt generation approach. 2. The paper is generally well written and organized, making it easy to understand the proposed framework. 3. The comparison tables show that the proposed method performs competitively against baseline models.
1. The paper only provides system-level comparisons, without quantitative ablation studies on the proposed components. For instance, since the framework adopts a multi-stage coarse-to-fine design, a natural comparison would be with a single-stage setup. Similarly, it would be helpful to include an analysis of the effect of using or removing local attention. Overall, each proposed module should be supported by quantitative ablations to clearly demonstrate its effectiveness. 2. Figure 2 seems to b
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Advanced Neural Network Applications · Medical Image Segmentation Techniques
