ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling
William Yicheng Zhu, Keren Ye, Junjie Ke, Jiahui Yu, Leonidas Guibas,, Peyman Milanfar, Feng Yang

TL;DR
ArtVLM introduces a novel sentence generation-based retrieval method for zero-shot visual attribute recognition, explicitly modeling object-attribute relations as a dependency-sensitive language modeling task, outperforming contrastive methods.
Contribution
The paper proposes a new generative retrieval approach that captures object-attribute dependencies for improved zero-shot attribute recognition using vision-language models.
Findings
Generative retrieval outperforms contrastive retrieval on VAW and VGARank datasets.
Explicit modeling of object-attribute relations improves recognition accuracy.
Method effectively distills knowledge from large pretrained vision-language models.
Abstract
Recognizing and disentangling visual attributes from objects is a foundation to many computer vision applications. While large vision language representations like CLIP had largely resolved the task of zero-shot object recognition, zero-shot visual attribute recognition remains a challenge because CLIP's contrastively-learned vision-language representation cannot effectively capture object-attribute dependencies. In this paper, we target this weakness and propose a sentence generation-based retrieval formulation for attribute recognition that is novel in 1) explicitly modeling a to-be-measured and retrieved object-attribute relation as a conditional probability graph, which converts the recognition problem into a dependency-sensitive language-modeling problem, and 2) applying a large pretrained Vision-Language Model (VLM) on this reformulation and naturally distilling its knowledge of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text and Document Classification Technologies · Multimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
