Open-Vocabulary Object Detectors: Robustness Challenges under Distribution Shifts
Prakash Chandra Chhipa, Kanjar De, Meenakshi Subhash Chippa, Rajkumar, Saini, Marcus Liwicki

TL;DR
This paper evaluates the robustness of recent open-vocabulary object detection models under various distribution shifts, revealing significant challenges and guiding future research for more reliable vision systems.
Contribution
It provides a comprehensive robustness assessment of three leading open-vocabulary object detectors across multiple challenging benchmarks.
Findings
Models show decreased performance under distribution shifts.
Robustness varies significantly across different types of shifts.
Highlights need for improved robustness in open-vocabulary detection.
Abstract
The challenge of Out-Of-Distribution (OOD) robustness remains a critical hurdle towards deploying deep vision models. Vision-Language Models (VLMs) have recently achieved groundbreaking results. VLM-based open-vocabulary object detection extends the capabilities of traditional object detection frameworks, enabling the recognition and classification of objects beyond predefined categories. Investigating OOD robustness in recent open-vocabulary object detection is essential to increase the trustworthiness of these models. This study presents a comprehensive robustness evaluation of the zero-shot capabilities of three recent open-vocabulary (OV) foundation object detection models: OWL-ViT, YOLO World, and Grounding DINO. Experiments carried out on the robustness benchmarks COCO-O, COCO-DC, and COCO-C encompassing distribution shifts due to information loss, corruption, adversarial attacks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Softmax · Linear Layer · Residual Connection · Multi-Head Attention · Dense Connections · Layer Normalization · Vision Transformer · self-DIstillation with NO labels
