Towards Annotation-Free Validation of MLLMs: A Vision-Language Logical Consistency Metric
Ying Gu, Mei Chee Leong, Hui Li Tan, Shangbo Mao, Liyuan Li, Nancy Chen

TL;DR
This paper introduces VL-LCM, a new framework for evaluating the logical consistency of vision-language models without relying on ground-truth annotations, addressing limitations of accuracy-based metrics.
Contribution
It proposes a novel, ground-truth-free logical consistency metric for MLLMs and validates its effectiveness across multiple benchmarks and challenges.
Findings
VL-LCM reveals significant logical consistency gaps in recent MLLMs despite accuracy improvements.
Extensive experiments confirm VL-LCM's validity and correlation with traditional metrics.
VL-LCM enables model validation and selection without ground-truth annotations.
Abstract
Dominant accuracy evaluation might reward unwarranted guessing of Large Language Models, and it might not be applicable to novel tasks for model validation without ground-truth (gt) annotation. Based on basic logic principle, we propose a novel framework to evaluate the vision-language logical consistency of MLLMs on both sufficient and necessary cause-effect relations. We define Vision-Language Logical Consistency Metric (VL-LCM) on traditional MC-VQA tests, and recent NaturalBench tests without the need for gt annotation. Through systematic experiments on representative VL benchmark MMMU and recent VL challenges like NaturalBench, we evaluated 11 recent open-source MLLMs from 4 frontier families. Our findings reveal that, despite significant progress of recent MLLMs on accuracy, logical consistency lags behind significantly. Extensive evaluations on the correlations of VL-LCM with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
