Will It Zero-Shot?: Predicting Zero-Shot Classification Performance For Arbitrary Queries

Kevin Robbins; Xiaotong Liu; Yu Wu; Le Sun; Grady McPeak; Abby Stylianou; Robert Pless

arXiv:2601.17535·cs.CV·March 26, 2026

Will It Zero-Shot?: Predicting Zero-Shot Classification Performance For Arbitrary Queries

Kevin Robbins, Xiaotong Liu, Yu Wu, Le Sun, Grady McPeak, Abby Stylianou, Robert Pless

PDF

Open Access

TL;DR

This paper proposes an image-based method to predict the zero-shot classification performance of vision-language models like CLIP, enabling users to assess model effectiveness for specific tasks without labeled data.

Contribution

It introduces a novel approach that uses synthetic images to improve zero-shot performance prediction, building on text-only evaluation methods.

Findings

01

Generated images significantly improve prediction accuracy.

02

The approach helps users assess model suitability without labeled data.

03

Experiments confirm effectiveness on standard benchmarks.

Abstract

Vision-Language Models like CLIP create aligned embedding spaces for text and images, making it possible for anyone to build a visual classifier by simply naming the classes they want to distinguish. However, a model that works well in one domain may fail in another, and non-expert users have no straightforward way to assess whether their chosen VLM will work on their problem. We build on prior work using text-only comparisons to evaluate how well a model works for a given natural language task, and explore approaches that also generate synthetic images relevant to that task to evaluate and refine the prediction of zero-shot accuracy. We show that generated imagery to the baseline text-only scores substantially improves the quality of these predictions. Additionally, it gives a user feedback on the kinds of images that were used to make the assessment. Experiments on standard CLIP…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Topic Modeling