Beyond Accuracy: Metrics that Uncover What Makes a 'Good' Visual Descriptor
Ethan Lin, Linxi Zhao, Atharva Sehgal, Jennifer J. Sun

TL;DR
This paper introduces new metrics to evaluate the quality of visual descriptors, focusing on their representational capacity and alignment with model pre-training data, providing insights beyond traditional accuracy measures.
Contribution
It proposes two novel alignment-based metrics, Global Alignment and CLIP Similarity, to better understand descriptor effectiveness in vision-language models.
Findings
Metrics reveal how descriptor strategies interact with model properties
Descriptors' effectiveness depends on semantic clarity and pre-training data presence
Evaluation methods extend beyond accuracy to assess descriptor quality
Abstract
Text-based visual descriptors--ranging from simple class names to more descriptive phrases--are widely used in visual concept discovery and image classification with vision-language models (VLMs). Their effectiveness, however, depends on a complex interplay of factors, including semantic clarity, presence in the VLM's pre-training data, and how well the descriptors serve as a meaningful representation space. In this work, we systematically analyze descriptor quality along two key dimensions: (1) representational capacity, and (2) relationship with VLM pre-training data. We evaluate a spectrum of descriptor generation methods, from zero-shot LLM-generated prompts to iteratively refined descriptors. Motivated by ideas from representation alignment and language understanding, we introduce two alignment-based metrics--Global Alignment and CLIP Similarity--that move beyond accuracy. These…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsContrastive Language-Image Pre-training
