A Training-Free Framework for Open-Vocabulary Image Segmentation and Recognition with EfficientNet and CLIP

Ying Dai; Wei Yu Chen

arXiv:2510.19333·cs.CV·October 28, 2025

A Training-Free Framework for Open-Vocabulary Image Segmentation and Recognition with EfficientNet and CLIP

Ying Dai, Wei Yu Chen

PDF

Open Access

TL;DR

This paper introduces a training-free, two-stage framework combining EfficientNet and CLIP for open-vocabulary image segmentation and recognition, achieving state-of-the-art results without model training.

Contribution

The novel framework integrates EfficientNet and CLIP with unsupervised segmentation and cross-modal alignment, enabling open-vocabulary recognition without training.

Findings

01

Achieves state-of-the-art performance on COCO, ADE20K, PASCAL VOC.

02

Effective unsupervised segmentation using SVD and hierarchical clustering.

03

Demonstrates flexibility and generalizability across benchmarks.

Abstract

This paper presents a novel training-free framework for open-vocabulary image segmentation and object recognition (OVSR), which leverages EfficientNetB0, a convolutional neural network, for unsupervised segmentation and CLIP, a vision-language model, for open-vocabulary object recognition. The proposed framework adopts a two stage pipeline: unsupervised image segmentation followed by segment-level recognition via vision-language alignment. In the first stage, pixel-wise features extracted from EfficientNetB0 are decomposed using singular value decomposition to obtain latent representations, which are then clustered using hierarchical clustering to segment semantically meaningful regions. The number of clusters is adaptively determined by the distribution of singular values. In the second stage, the segmented regions are localized and encoded into image embeddings using the Vision…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Handwritten Text Recognition Techniques