Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models
Yuhang Huang, Zihan Wu, Chongyang Gao, Jiawei Peng, Xu Yang

TL;DR
This paper evaluates the ability of large vision-language models to generate precise, fine-grained descriptions, focusing on their distinctiveness and fidelity, and introduces the TRAC framework for analysis.
Contribution
It introduces the TRAC framework for analyzing fine-grained visual descriptions and compares models like Open-Flamingo, IDEFICS, and MiniGPT-4 in this context.
Findings
MiniGPT-4 outperforms others in fine-grained description quality.
LVLMs vary significantly in their ability to distinguish similar objects.
The TRAC framework provides new insights into model description capabilities.
Abstract
Large Vision-Language Models (LVLMs) are gaining traction for their remarkable ability to process and integrate visual and textual data. Despite their popularity, the capacity of LVLMs to generate precise, fine-grained textual descriptions has not been fully explored. This study addresses this gap by focusing on \textit{distinctiveness} and \textit{fidelity}, assessing how models like Open-Flamingo, IDEFICS, and MiniGPT-4 can distinguish between similar objects and accurately describe visual features. We proposed the Textual Retrieval-Augmented Classification (TRAC) framework, which, by leveraging its generative capabilities, allows us to delve deeper into analyzing fine-grained visual description generation. This research provides valuable insights into the generation quality of LVLMs, enhancing the understanding of multimodal language models. Notably, MiniGPT-4 stands out for its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Geographic Information Systems Studies
