How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection
Yiyang Yao, Peng Liu, Tiancheng Zhao, Qianqian Zhang, Jiajia Liao,, Chunxin Fang, Kyusong Lee, Qing Wang

TL;DR
This paper introduces OVDEval, a comprehensive benchmark with 9 sub-tasks and a new evaluation metric NMS-AP to better assess the generalization and understanding capabilities of open-vocabulary object detection models.
Contribution
The paper presents a new benchmark dataset with fine-grained tasks and a novel evaluation metric addressing limitations of existing methods in open-vocabulary detection.
Findings
Existing top OVD models struggle on new fine-grained tasks.
NMS-AP provides more accurate evaluation than traditional AP.
Benchmark reveals weaknesses in current OVD models.
Abstract
Object detection (OD) in computer vision has made significant progress in recent years, transitioning from closed-set labels to open-vocabulary detection (OVD) based on large-scale vision-language pre-training (VLP). However, current evaluation methods and datasets are limited to testing generalization over object types and referral expressions, which do not provide a systematic, fine-grained, and accurate benchmark of OVD models' abilities. In this paper, we propose a new benchmark named OVDEval, which includes 9 sub-tasks and introduces evaluations on commonsense knowledge, attribute understanding, position understanding, object relation comprehension, and more. The dataset is meticulously created to provide hard negatives that challenge models' true understanding of visual and linguistic input. Additionally, we identify a problem with the popular Average Precision (AP) metric when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
