Tree of Attributes Prompt Learning for Vision-Language Models
Tong Ding, Wanhua Li, Zhongqi Miao, Hanspeter Pfister

TL;DR
This paper introduces Tree of Attributes Prompt learning (TAP), a method that leverages structured attribute hierarchies generated by LLMs to improve vision-language model adaptation for various classification tasks.
Contribution
TAP distills structured knowledge graphs from LLMs and incorporates explicit visual attribute learning, enhancing zero-shot and few-shot classification performance.
Findings
Outperforms state-of-the-art methods on multiple datasets.
Improves zero-shot base-to-novel generalization.
Enhances cross-dataset transfer and few-shot classification.
Abstract
Prompt learning has proven effective in adapting vision language models for downstream tasks. However, existing methods usually append learnable prompt tokens solely with the category names to obtain textual features, which fails to fully leverage the rich context indicated in the category name. To address this issue, we propose the Tree of Attributes Prompt learning (TAP), which first instructs LLMs to generate a tree of attributes with a "concept - attribute - description" structure for each category, and then learn the hierarchy with vision and text prompt tokens. Unlike existing methods that merely augment category names with a set of unstructured descriptions, our approach essentially distills structured knowledge graphs associated with class names from LLMs. Furthermore, our approach introduces text and vision prompts designed to explicitly learn the corresponding visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling
MethodsSparse Evolutionary Training
