On the Potential of Open-Vocabulary Models for Object Detection in   Unusual Street Scenes

Sadia Ilyas; Ido Freeman; Matthias Rottmann

arXiv:2408.11221·cs.CV·August 22, 2024

On the Potential of Open-Vocabulary Models for Object Detection in Unusual Street Scenes

Sadia Ilyas, Ido Freeman, Matthias Rottmann

PDF

Open Access

TL;DR

This paper evaluates the effectiveness of open-vocabulary object detection models in identifying unusual and out-of-distribution objects in street scenes, highlighting their potential and current limitations for real-world applications.

Contribution

The study benchmarks four state-of-the-art open-vocabulary object detectors across three datasets, revealing their strengths and shortcomings in challenging street scene scenarios.

Findings

01

Grounding DINO achieves top AP of 48.3% on RoadObstacle21.

02

YOLO-World achieves 21.2% AP on RoadAnomaly21.

03

Open-vocabulary models show promise but require improvements for reliable deployment.

Abstract

Out-of-distribution (OOD) object detection is a critical task focused on detecting objects that originate from a data distribution different from that of the training data. In this study, we investigate to what extent state-of-the-art open-vocabulary object detectors can detect unusual objects in street scenes, which are considered as OOD or rare scenarios with respect to common street scene datasets. Specifically, we evaluate their performance on the OoDIS Benchmark, which extends RoadAnomaly21 and RoadObstacle21 from SegmentMeIfYouCan, as well as LostAndFound, which was recently extended to object level annotations. The objective of our study is to uncover short-comings of contemporary object detectors in challenging real-world, and particularly in open-world scenarios. Our experiments reveal that open vocabulary models are promising for OOD object detection scenarios, however far…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications

MethodsSoftmax · Linear Layer · Residual Connection · Multi-Head Attention · Layer Normalization · Attention Is All You Need · Dense Connections · Vision Transformer · self-DIstillation with NO labels