Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models
Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, Ajay Divakaran

TL;DR
This paper introduces the CURE benchmark and a two-stage training framework to evaluate and enhance the reasoning consistency and performance of vision-language models, revealing current models' limitations and proposing improvements.
Contribution
The paper develops a cost-effective LLM-Human-in-the-Loop pipeline, creates the CURE benchmark for reasoning evaluation, and proposes a novel training framework to improve VLM reasoning and consistency.
Findings
Existing VLMs lack strong reasoning consistency.
The CURE benchmark effectively measures reasoning performance.
The proposed training framework improves reasoning accuracy and consistency.
Abstract
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can parse natural queries about the visual content and generate human-like outputs. In this work, we explore the ability of these models to demonstrate human-like reasoning based on the perceived information. To address a crucial concern regarding the extent to which their reasoning capabilities are fully consistent and grounded, we also measure the reasoning consistency of these models. We achieve this by proposing a chain-of-thought (CoT) based consistency measure. However, such an evaluation requires a benchmark that encompasses both high-level inference and detailed reasoning chains, which is costly. We tackle this challenge by proposing a LLM-Human-in-the-Loop pipeline, which notably reduces cost while simultaneously ensuring the generation of a high-quality dataset. Based on this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Domain Adaptation and Few-Shot Learning
