S3OD: Towards Generalizable Salient Object Detection with Synthetic Data

Orest Kupyn; Hirokatsu Kataoka; Christian Rupprecht

arXiv:2510.21605·cs.CV·March 3, 2026

S3OD: Towards Generalizable Salient Object Detection with Synthetic Data

Orest Kupyn, Hirokatsu Kataoka, Christian Rupprecht

PDF

3 Reviews

TL;DR

This paper introduces S3OD, a synthetic dataset and architecture that significantly enhances the generalization of salient object detection models across various datasets and tasks.

Contribution

The paper presents a large-scale synthetic dataset and an ambiguity-aware model architecture that improve generalization in salient object detection.

Findings

01

Models trained on synthetic data show 20-50% error reduction in cross-dataset tests.

02

Fine-tuned models achieve state-of-the-art results on DIS and HR-SOD benchmarks.

03

The multi-modal diffusion pipeline effectively extracts labels from features.

Abstract

Salient object detection exemplifies data-bounded tasks where expensive pixel-precise annotations force separate model training for related subtasks like DIS and HR-SOD. We present a method that dramatically improves generalization through large-scale synthetic data generation and ambiguity-aware architecture. We introduce S3OD, a dataset of over 139,000 high-resolution images created through our multi-modal diffusion pipeline that extracts labels from diffusion and DINO-v3 features. The iterative generation framework prioritizes challenging categories based on model performance. We propose a streamlined multi-mask decoder that handles the inherent ambiguity in salient object detection by predicting multiple valid interpretations. Models trained only on synthetic data achieve 20-50% error reduction in cross-dataset generalization, while fine-tuned versions reach state-of-the-art…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 5

Strengths

**1) Scale and Diversity of Synthetic Data (Figure 1, Table 1, Figure 4, Figure 6):** S3OD delivers an order-of-magnitude increase in dataset scale for SOD, with 139k+ images spanning 1676 unique objects and a wide spectrum of scene types, lighting, and occlusions, as seen qualitatively in Figures 1, 4, and 6 and quantitatively in Table 1. Manually verified mask quality and data curation strategies, including filtering with VLMs, result in a synthetic dataset that rivals or exceeds real sets in

Weaknesses

**Potential for Domain Overfitting or Synthetic-“Leakage” Not Fully Addressed:** 1. Although the cross-dataset generalization is well documented, concerns about overfitting to synthetic artifacts (such as those possibly present in highly artificial or LLM-generated prompts) are only partly mitigated by filtering and photo-realism tuning (see Figure 7 and Section B). There is no explicit domain gap or bias quantification (such as t-SNE/UMAP distributions, or model calibration metrics) to back up

Reviewer 02Rating 6Confidence 4

Strengths

- Multi-Modal Dataset Diffusion Pipeline that fuses diffusion feature maps, concept attention maps, and DINO-v3 representations to jointly generate images and masks, ensuring strong image–label alignment and enabling a 139k+ high-resolution synthetic set that boosts generalization. - Ambiguity-aware architecture with a streamlined multi-mask decoder that explicitly models multiple valid interpretations. - Iterative generation framework that is feedback-driven to prioritize challenging categori

Weaknesses

- The author should clarify whether mask extraction in the Multi-Modal Dataset Diffusion pipeline requires any training or calibration. If not, provide rigorous evidence of mask fidelity. - The proposed data-generation paradigm appears tailored to binary/saliency segmentation; please evaluate transfer to camouflaged object detection (COD) or non-salient classes and report the zero-shot and fine-tuned results. - The author should strengthen the annotation rationale with interpretable visualizat

Reviewer 03Rating 2Confidence 5

Strengths

+ The authors introduced S3OD dataset with 139,000+ samples. + Models trained only on the S3OD data show good performance.

Weaknesses

- The authors use S3OD as the name for both method and dataset. This causes confusion in the paper reading. - The novelty of the proposed S3OD method is incremental. All components of S3OD exist in literature. - The new S3OD dataset is AI-generated. However, all existing SOD datasets used ground truth from humans. The ground truth of S3OD dataset should come from humans. - The experiments are unfair. All baselines were not trained on the S3OD dataset. More than 139,000+ samples of S3OD should

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.