SemSegBench & DetecBench: Benchmarking Reliability and Generalization Beyond Classification

Shashank Agnihotri; David Schader; Jonas Jakubassa; Nico Sharei; Simon Kral; Mehmet Ege Ka\c{c}ar; Ruben Weber; Margret Keuper

arXiv:2505.18015·cs.CV·May 26, 2025

SemSegBench & DetecBench: Benchmarking Reliability and Generalization Beyond Classification

Shashank Agnihotri, David Schader, Jonas Jakubassa, Nico Sharei, Simon Kral, Mehmet Ege Ka\c{c}ar, Ruben Weber, Margret Keuper

PDF

1 Repo

TL;DR

This paper introduces benchmarking tools SEMSEGBENCH and DETECBENCH to evaluate the robustness and generalization of segmentation and detection models under adversarial attacks and corruptions, revealing systematic weaknesses and guiding future improvements.

Contribution

It provides the most extensive evaluation to date of segmentation and detection models' reliability, along with open-source benchmarking tools and a large dataset of evaluations for future research.

Findings

01

Systematic weaknesses in state-of-the-art models under adversarial attacks.

02

Key trends related to architecture, backbone, and model capacity.

03

Benchmarking results across multiple datasets and models.

Abstract

Reliability and generalization in deep learning are predominantly studied in the context of image classification. Yet, real-world applications in safety-critical domains involve a broader set of semantic tasks, such as semantic segmentation and object detection, which come with a diverse set of dedicated model architectures. To facilitate research towards robust model design in segmentation and detection, our primary objective is to provide benchmarking tools regarding robustness to distribution shifts and adversarial manipulations. We propose the benchmarking tools SEMSEGBENCH and DETECBENCH, along with the most extensive evaluation to date on the reliability and generalization of semantic segmentation and object detection models. In particular, we benchmark 76 segmentation models across four datasets and 61 object detectors across two datasets, evaluating their performance under…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shashankskagnihotri/benchmarking_reliability_generalization
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSparse Evolutionary Training