Evaluating Attribute Comprehension in Large Vision-Language Models
Haiwen Zhang, Zixi Yang, Yuanzhi Liu, Xinran Wang, Zheqi He, Kongming, Liang, and Zhanyu Ma

TL;DR
This paper assesses how well large vision-language models understand object attributes, focusing on recognition and hierarchy, revealing strengths and limitations in their fine-grained visual comprehension abilities.
Contribution
It introduces a comprehensive evaluation framework for attribute comprehension in vision-language models, highlighting the impact of fine-tuning data and interaction types on their understanding.
Findings
Models excel at attribute recognition but have limited hierarchical understanding.
Image-text matching outperforms visual question answering in attribute comprehension.
Caption attribute information significantly influences fine-tuning effectiveness.
Abstract
Currently, large vision-language models have gained promising progress on many downstream tasks. However, they still suffer many challenges in fine-grained visual understanding tasks, such as object attribute comprehension. Besides, there have been growing efforts on the evaluations of large vision-language models, but lack of in-depth study of attribute comprehension and the visual language fine-tuning process. In this paper, we propose to evaluate the attribute comprehension ability of large vision-language models from two perspectives: attribute recognition and attribute hierarchy understanding. We evaluate three vision-language interactions, including visual question answering, image-text matching, and image-text cosine similarity. Furthermore, we explore the factors affecting attribute comprehension during fine-tuning. Through a series of quantitative and qualitative experiments,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks
