Verbalized Representation Learning for Interpretable Few-Shot Generalization
Cheng-Fu Yang, Da Yin, Wenbo Hu, Heng Ji, Nanyun Peng, Bolei Zhou, Kai-Wei Chang

TL;DR
This paper introduces Verbalized Representation Learning (VRL), a method that automatically extracts human-interpretable features from few-shot data using vision-language models, significantly improving low-data object recognition.
Contribution
VRL is a novel approach that captures verbalized features for interpretability and enhances few-shot generalization, outperforming prior methods with less data.
Findings
24% absolute improvement over state-of-the-art
Uses 95% less data than previous methods
Features outperform human-labeled attributes by 20%
Abstract
Humans recognize objects after observing only a few examples, a remarkable capability enabled by their inherent language understanding of the real-world environment. Developing verbalized and interpretable representation can significantly improve model generalization in low-data settings. In this work, we propose Verbalized Representation Learning (VRL), a novel approach for automatically extracting human-interpretable features for object recognition using few-shot data. Our method uniquely captures inter-class differences and intra-class commonalities in the form of natural language by employing a Vision-Language Model (VLM) to identify key discriminative features between different classes and shared characteristics within the same class. These verbalized features are then mapped to numeric vectors through the VLM. The resulting feature vectors can be further utilized to train and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis
