Can We Challenge Open-Vocabulary Object Detectors with Generated Content in Street Scenes?
Annika M\"utze, Sadia Ilyas, Christian D\"orpelkus, Matthias Rottmann

TL;DR
This paper investigates the limitations of open-vocabulary object detectors by using synthetic data generated through inpainting with stable diffusion, revealing their weaknesses and dependence on object location.
Contribution
The study introduces automated pipelines for generating challenging synthetic data to systematically evaluate and identify failure modes of open-vocabulary object detectors.
Findings
Open-vocabulary detectors can overlook objects in synthetic images.
Models show strong dependence on object location over semantics.
Synthetic data reveals systematic failure modes.
Abstract
Open-vocabulary object detectors such as Grounding DINO are trained on vast and diverse data, achieving remarkable performance on challenging datasets. Due to that, it is unclear where to find their limitations, which is of major concern when using in safety-critical applications. Real-world data does not provide sufficient control, required for a rigorous evaluation of model generalization. In contrast, synthetically generated data allows to systematically explore the boundaries of model competence/generalization. In this work, we address two research questions: 1) Can we challenge open-vocabulary object detectors with generated image content? 2) Can we find systematic failure modes of those models? To address these questions, we design two automated pipelines using stable diffusion to inpaint unusual objects with high diversity in semantics, by sampling multiple substantives from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
