Unveiling the Tapestry of Consistency in Large Vision-Language Models
Yuan Zhang, Fei Xiao, Tao Huang, Chun-Kai Fan, Hongyuan Dong, Jiawen, Li, Jiacong Wang, Kuan Cheng, Shanghang Zhang, Haoyuan Guo

TL;DR
This paper introduces ConBench, a benchmark for evaluating consistency in large vision-language models across different solution spaces, revealing key insights and proposing diagnostic refinement to improve model reliability.
Contribution
It presents the first multi-modal benchmark ConBench for analyzing LVLM consistency and uncovers relationships between solution space size, accuracy, and model bias, proposing refinement methods.
Findings
Larger solution spaces decrease accuracy in the discriminate realm.
Discriminative question accuracy correlates with caption consistency.
Closed-source models show higher bias in consistency.
Abstract
Large vision-language models (LVLMs) have recently achieved rapid progress, exhibiting great perception and reasoning abilities concerning visual information. However, when faced with prompts in different sizes of solution spaces, LVLMs fail to always give consistent answers regarding the same knowledge point. This inconsistency of answers between different solution spaces is prevalent in LVLMs and erodes trust. To this end, we provide a multi-modal benchmark ConBench, to intuitively analyze how LVLMs perform when the solution space of a prompt revolves around a knowledge point. Based on the ConBench tool, we are the first to reveal the tapestry and get the following findings: (1) In the discriminate realm, the larger the solution space of the prompt, the lower the accuracy of the answers. (2) Establish the relationship between the discriminative and generative realms: the accuracy of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications
