Does VLM Classification Benefit from LLM Description Semantics?
Pingchuan Ma, Lennart Rietdorf, Dmytro Kotovenko, Vincent Tao Hu,, Bj\"orn Ommer

TL;DR
This paper investigates whether Large Language Model-generated descriptions genuinely enhance Vision-Language Model classification by analyzing their semantic contribution, proposing an evaluation scenario and a training-free selection method to improve accuracy and explainability.
Contribution
It introduces an evaluation scenario to distinguish true semantic benefits from noise effects and proposes a training-free method for selecting discriminative descriptions for VLM classification.
Findings
Descriptions with genuine semantics improve classification accuracy.
The proposed method outperforms baseline approaches across seven datasets.
Insights into explainability of description-based classification are provided.
Abstract
Accurately describing images with text is a foundation of explainable AI. Vision-Language Models (VLMs) like CLIP have recently addressed this by aligning images and texts in a shared embedding space, expressing semantic similarities between vision and language embeddings. VLM classification can be improved with descriptions generated by Large Language Models (LLMs). However, it is difficult to determine the contribution of actual description semantics, as the performance gain may also stem from a semantic-agnostic ensembling effect, where multiple modified text prompts act as a noisy test-time augmentation for the original one. We propose an alternative evaluation scenario to decide if a performance boost of LLM-generated descriptions is caused by such a noise augmentation effect or rather by genuine description semantics. The proposed scenario avoids noisy test-time augmentation and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Advanced Computational Techniques and Applications
MethodsContrastive Language-Image Pre-training
