TL;DR
This paper introduces S3OD, a synthetic dataset and architecture that significantly enhances the generalization of salient object detection models across various datasets and tasks.
Contribution
The paper presents a large-scale synthetic dataset and an ambiguity-aware model architecture that improve generalization in salient object detection.
Findings
Models trained on synthetic data show 20-50% error reduction in cross-dataset tests.
Fine-tuned models achieve state-of-the-art results on DIS and HR-SOD benchmarks.
The multi-modal diffusion pipeline effectively extracts labels from features.
Abstract
Salient object detection exemplifies data-bounded tasks where expensive pixel-precise annotations force separate model training for related subtasks like DIS and HR-SOD. We present a method that dramatically improves generalization through large-scale synthetic data generation and ambiguity-aware architecture. We introduce S3OD, a dataset of over 139,000 high-resolution images created through our multi-modal diffusion pipeline that extracts labels from diffusion and DINO-v3 features. The iterative generation framework prioritizes challenging categories based on model performance. We propose a streamlined multi-mask decoder that handles the inherent ambiguity in salient object detection by predicting multiple valid interpretations. Models trained only on synthetic data achieve 20-50% error reduction in cross-dataset generalization, while fine-tuned versions reach state-of-the-art…
Peer Reviews
Decision·ICLR 2026 Poster
**1) Scale and Diversity of Synthetic Data (Figure 1, Table 1, Figure 4, Figure 6):** S3OD delivers an order-of-magnitude increase in dataset scale for SOD, with 139k+ images spanning 1676 unique objects and a wide spectrum of scene types, lighting, and occlusions, as seen qualitatively in Figures 1, 4, and 6 and quantitatively in Table 1. Manually verified mask quality and data curation strategies, including filtering with VLMs, result in a synthetic dataset that rivals or exceeds real sets in
**Potential for Domain Overfitting or Synthetic-“Leakage” Not Fully Addressed:** 1. Although the cross-dataset generalization is well documented, concerns about overfitting to synthetic artifacts (such as those possibly present in highly artificial or LLM-generated prompts) are only partly mitigated by filtering and photo-realism tuning (see Figure 7 and Section B). There is no explicit domain gap or bias quantification (such as t-SNE/UMAP distributions, or model calibration metrics) to back up
- Multi-Modal Dataset Diffusion Pipeline that fuses diffusion feature maps, concept attention maps, and DINO-v3 representations to jointly generate images and masks, ensuring strong image–label alignment and enabling a 139k+ high-resolution synthetic set that boosts generalization. - Ambiguity-aware architecture with a streamlined multi-mask decoder that explicitly models multiple valid interpretations. - Iterative generation framework that is feedback-driven to prioritize challenging categori
- The author should clarify whether mask extraction in the Multi-Modal Dataset Diffusion pipeline requires any training or calibration. If not, provide rigorous evidence of mask fidelity. - The proposed data-generation paradigm appears tailored to binary/saliency segmentation; please evaluate transfer to camouflaged object detection (COD) or non-salient classes and report the zero-shot and fine-tuned results. - The author should strengthen the annotation rationale with interpretable visualizat
+ The authors introduced S3OD dataset with 139,000+ samples. + Models trained only on the S3OD data show good performance.
- The authors use S3OD as the name for both method and dataset. This causes confusion in the paper reading. - The novelty of the proposed S3OD method is incremental. All components of S3OD exist in literature. - The new S3OD dataset is AI-generated. However, all existing SOD datasets used ground truth from humans. The ground truth of S3OD dataset should come from humans. - The experiments are unfair. All baselines were not trained on the S3OD dataset. More than 139,000+ samples of S3OD should
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
