Unveiling the Tapestry of Consistency in Large Vision-Language Models

Yuan Zhang; Fei Xiao; Tao Huang; Chun-Kai Fan; Hongyuan Dong; Jiawen; Li; Jiacong Wang; Kuan Cheng; Shanghang Zhang; Haoyuan Guo

arXiv:2405.14156·cs.CV·October 8, 2024·1 cites

Unveiling the Tapestry of Consistency in Large Vision-Language Models

Yuan Zhang, Fei Xiao, Tao Huang, Chun-Kai Fan, Hongyuan Dong, Jiawen, Li, Jiacong Wang, Kuan Cheng, Shanghang Zhang, Haoyuan Guo

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces ConBench, a benchmark for evaluating consistency in large vision-language models across different solution spaces, revealing key insights and proposing diagnostic refinement to improve model reliability.

Contribution

It presents the first multi-modal benchmark ConBench for analyzing LVLM consistency and uncovers relationships between solution space size, accuracy, and model bias, proposing refinement methods.

Findings

01

Larger solution spaces decrease accuracy in the discriminate realm.

02

Discriminative question accuracy correlates with caption consistency.

03

Closed-source models show higher bias in consistency.

Abstract

Large vision-language models (LVLMs) have recently achieved rapid progress, exhibiting great perception and reasoning abilities concerning visual information. However, when faced with prompts in different sizes of solution spaces, LVLMs fail to always give consistent answers regarding the same knowledge point. This inconsistency of answers between different solution spaces is prevalent in LVLMs and erodes trust. To this end, we provide a multi-modal benchmark ConBench, to intuitively analyze how LVLMs perform when the solution space of a prompt revolves around a knowledge point. Based on the ConBench tool, we are the first to reveal the tapestry and get the following findings: (1) In the discriminate realm, the larger the solution space of the prompt, the lower the accuracy of the answers. (2) Establish the relationship between the discriminative and generative realms: the accuracy of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

foundation-multimodal-models/conbench
pytorchOfficial

Videos

Unveiling the Tapestry of Consistency in Large Vision-Language Models· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications