Exploring Open-Vocabulary Object Recognition in Images using CLIP
Wei Yu Chen, Ying Dai

TL;DR
This paper introduces a simplified, training-free open-vocabulary object recognition framework using CLIP and CNN/MLP encodings, achieving state-of-the-art results on multiple datasets without complex retraining.
Contribution
It proposes a novel two-stage OVOR framework that eliminates retraining, combines CLIP with CNN/MLP encodings, and demonstrates superior performance on standard benchmarks.
Findings
Training-free CLIP-based encoding outperforms existing methods.
CNN/MLP-based encoding enhances recognition flexibility.
Achieves highest average AP on COCO, Pascal VOC, and ADE20K.
Abstract
To address the limitations of existing open-vocabulary object recognition methods, specifically high system complexity, substantial training costs, and limited generalization, this paper proposes a novel Open-Vocabulary Object Recognition (OVOR) framework based on a streamlined two-stage strategy: object segmentation followed by recognition. The framework eliminates the need for complex retraining and labor-intensive annotation. After cropping object regions, we generate object-level image embeddings alongside category-level text embeddings using CLIP, which facilitates arbitrary vocabularies. To reduce reliance on CLIP and enhance encoding flexibility, we further introduce a CNN/MLP-based method that extracts convolutional neural network (CNN) feature maps and utilizes a multilayer perceptron (MLP) to align visual features with text embeddings. These embeddings are concatenated and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
