Text Descriptions are Compressive and Invariant Representations for Visual Learning
Zhili Feng, Anna Bair, J. Zico Kolter

TL;DR
This paper introduces SLR-AVD, a novel method that uses multiple visual descriptions generated by language models and sparse logistic regression to create invariant, compressed visual features, improving robustness in few-shot learning.
Contribution
The paper proposes SLR-AVD, a new approach combining language-generated visual descriptions with sparse regression for invariant feature extraction in visual learning.
Findings
Descriptive features are more invariant to domain shifts.
SLR-AVD outperforms state-of-the-art finetuning methods.
Invariant features improve in- and out-of-distribution performance.
Abstract
Modern image classification is based upon directly predicting classes via large discriminative networks, which do not directly contain information about the intuitive visual features that may constitute a classification decision. Recently, work in vision-language models (VLM) such as CLIP has provided ways to specify natural language descriptions of image classes, but typically focuses on providing single descriptions for each class. In this work, we demonstrate that an alternative approach, in line with humans' understanding of multiple visual features per class, can also provide compelling performance in the robust few-shot learning setting. In particular, we introduce a novel method, \textit{SLR-AVD (Sparse Logistic Regression using Augmented Visual Descriptors)}. This method first automatically generates multiple visual descriptions of each class via a large language model (LLM),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsFocus · Logistic Regression · Contrastive Language-Image Pre-training
