Text Descriptions are Compressive and Invariant Representations for   Visual Learning

Zhili Feng; Anna Bair; J. Zico Kolter

arXiv:2307.04317·cs.CV·October 31, 2023

Text Descriptions are Compressive and Invariant Representations for Visual Learning

Zhili Feng, Anna Bair, J. Zico Kolter

PDF

Open Access

TL;DR

This paper introduces SLR-AVD, a novel method that uses multiple visual descriptions generated by language models and sparse logistic regression to create invariant, compressed visual features, improving robustness in few-shot learning.

Contribution

The paper proposes SLR-AVD, a new approach combining language-generated visual descriptions with sparse regression for invariant feature extraction in visual learning.

Findings

01

Descriptive features are more invariant to domain shifts.

02

SLR-AVD outperforms state-of-the-art finetuning methods.

03

Invariant features improve in- and out-of-distribution performance.

Abstract

Modern image classification is based upon directly predicting classes via large discriminative networks, which do not directly contain information about the intuitive visual features that may constitute a classification decision. Recently, work in vision-language models (VLM) such as CLIP has provided ways to specify natural language descriptions of image classes, but typically focuses on providing single descriptions for each class. In this work, we demonstrate that an alternative approach, in line with humans' understanding of multiple visual features per class, can also provide compelling performance in the robust few-shot learning setting. In particular, we introduce a novel method, \textit{SLR-AVD (Sparse Logistic Regression using Augmented Visual Descriptors)}. This method first automatically generates multiple visual descriptions of each class via a large language model (LLM),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques

MethodsFocus · Logistic Regression · Contrastive Language-Image Pre-training