Exploring Open-Vocabulary Object Recognition in Images using CLIP

Wei Yu Chen; Ying Dai

arXiv:2603.05962·cs.CV·March 9, 2026

Exploring Open-Vocabulary Object Recognition in Images using CLIP

Wei Yu Chen, Ying Dai

PDF

Open Access

TL;DR

This paper introduces a simplified, training-free open-vocabulary object recognition framework using CLIP and CNN/MLP encodings, achieving state-of-the-art results on multiple datasets without complex retraining.

Contribution

It proposes a novel two-stage OVOR framework that eliminates retraining, combines CLIP with CNN/MLP encodings, and demonstrates superior performance on standard benchmarks.

Findings

01

Training-free CLIP-based encoding outperforms existing methods.

02

CNN/MLP-based encoding enhances recognition flexibility.

03

Achieves highest average AP on COCO, Pascal VOC, and ADE20K.

Abstract

To address the limitations of existing open-vocabulary object recognition methods, specifically high system complexity, substantial training costs, and limited generalization, this paper proposes a novel Open-Vocabulary Object Recognition (OVOR) framework based on a streamlined two-stage strategy: object segmentation followed by recognition. The framework eliminates the need for complex retraining and labor-intensive annotation. After cropping object regions, we generate object-level image embeddings alongside category-level text embeddings using CLIP, which facilitates arbitrary vocabularies. To reduce reliance on CLIP and enhance encoding flexibility, we further introduce a CNN/MLP-based method that extracts convolutional neural network (CNN) feature maps and utilizes a multilayer perceptron (MLP) to align visual features with text embeddings. These embeddings are concatenated and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques