Beyond Accuracy: Metrics that Uncover What Makes a 'Good' Visual Descriptor

Ethan Lin; Linxi Zhao; Atharva Sehgal; Jennifer J. Sun

arXiv:2507.03542·cs.CV·July 10, 2025

Beyond Accuracy: Metrics that Uncover What Makes a 'Good' Visual Descriptor

Ethan Lin, Linxi Zhao, Atharva Sehgal, Jennifer J. Sun

PDF

Open Access 1 Repo

TL;DR

This paper introduces new metrics to evaluate the quality of visual descriptors, focusing on their representational capacity and alignment with model pre-training data, providing insights beyond traditional accuracy measures.

Contribution

It proposes two novel alignment-based metrics, Global Alignment and CLIP Similarity, to better understand descriptor effectiveness in vision-language models.

Findings

01

Metrics reveal how descriptor strategies interact with model properties

02

Descriptors' effectiveness depends on semantic clarity and pre-training data presence

03

Evaluation methods extend beyond accuracy to assess descriptor quality

Abstract

Text-based visual descriptors--ranging from simple class names to more descriptive phrases--are widely used in visual concept discovery and image classification with vision-language models (VLMs). Their effectiveness, however, depends on a complex interplay of factors, including semantic clarity, presence in the VLM's pre-training data, and how well the descriptors serve as a meaningful representation space. In this work, we systematically analyze descriptor quality along two key dimensions: (1) representational capacity, and (2) relationship with VLM pre-training data. We evaluate a spectrum of descriptor generation methods, from zero-shot LLM-generated prompts to iteratively refined descriptors. Motivated by ideas from representation alignment and language understanding, we introduce two alignment-based metrics--Global Alignment and CLIP Similarity--that move beyond accuracy. These…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ethan-y-lin/beyond_accuracy
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques

MethodsContrastive Language-Image Pre-training