Exploring the Distinctiveness and Fidelity of the Descriptions Generated   by Large Vision-Language Models

Yuhang Huang; Zihan Wu; Chongyang Gao; Jiawei Peng; Xu Yang

arXiv:2404.17534·cs.CV·April 29, 2024·1 cites

Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models

Yuhang Huang, Zihan Wu, Chongyang Gao, Jiawei Peng, Xu Yang

PDF

Open Access

TL;DR

This paper evaluates the ability of large vision-language models to generate precise, fine-grained descriptions, focusing on their distinctiveness and fidelity, and introduces the TRAC framework for analysis.

Contribution

It introduces the TRAC framework for analyzing fine-grained visual descriptions and compares models like Open-Flamingo, IDEFICS, and MiniGPT-4 in this context.

Findings

01

MiniGPT-4 outperforms others in fine-grained description quality.

02

LVLMs vary significantly in their ability to distinguish similar objects.

03

The TRAC framework provides new insights into model description capabilities.

Abstract

Large Vision-Language Models (LVLMs) are gaining traction for their remarkable ability to process and integrate visual and textual data. Despite their popularity, the capacity of LVLMs to generate precise, fine-grained textual descriptions has not been fully explored. This study addresses this gap by focusing on \textit{distinctiveness} and \textit{fidelity}, assessing how models like Open-Flamingo, IDEFICS, and MiniGPT-4 can distinguish between similar objects and accurately describe visual features. We proposed the Textual Retrieval-Augmented Classification (TRAC) framework, which, by leveraging its generative capabilities, allows us to delve deeper into analyzing fine-grained visual description generation. This research provides valuable insights into the generation quality of LVLMs, enhancing the understanding of multimodal language models. Notably, MiniGPT-4 stands out for its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Geographic Information Systems Studies