ArtVLM: Attribute Recognition Through Vision-Based Prefix Language   Modeling

William Yicheng Zhu; Keren Ye; Junjie Ke; Jiahui Yu; Leonidas Guibas,; Peyman Milanfar; Feng Yang

arXiv:2408.04102·cs.CV·October 3, 2024

ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling

William Yicheng Zhu, Keren Ye, Junjie Ke, Jiahui Yu, Leonidas Guibas,, Peyman Milanfar, Feng Yang

PDF

Open Access 1 Repo

TL;DR

ArtVLM introduces a novel sentence generation-based retrieval method for zero-shot visual attribute recognition, explicitly modeling object-attribute relations as a dependency-sensitive language modeling task, outperforming contrastive methods.

Contribution

The paper proposes a new generative retrieval approach that captures object-attribute dependencies for improved zero-shot attribute recognition using vision-language models.

Findings

01

Generative retrieval outperforms contrastive retrieval on VAW and VGARank datasets.

02

Explicit modeling of object-attribute relations improves recognition accuracy.

03

Method effectively distills knowledge from large pretrained vision-language models.

Abstract

Recognizing and disentangling visual attributes from objects is a foundation to many computer vision applications. While large vision language representations like CLIP had largely resolved the task of zero-shot object recognition, zero-shot visual attribute recognition remains a challenge because CLIP's contrastively-learned vision-language representation cannot effectively capture object-attribute dependencies. In this paper, we target this weakness and propose a sentence generation-based retrieval formulation for attribute recognition that is novel in 1) explicitly modeling a to-be-measured and retrieved object-attribute relation as a conditional probability graph, which converts the recognition problem into a dependency-sensitive language-modeling problem, and 2) applying a large pretrained Vision-Language Model (VLM) on this reformulation and naturally distilling its knowledge of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research/google-research
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text and Document Classification Technologies · Multimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training