Measuring and Improving Chain-of-Thought Reasoning in Vision-Language   Models

Yangyi Chen; Karan Sikka; Michael Cogswell; Heng Ji; Ajay Divakaran

arXiv:2309.04461·cs.CL·March 21, 2024·1 cites

Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models

Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, Ajay Divakaran

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces the CURE benchmark and a two-stage training framework to evaluate and enhance the reasoning consistency and performance of vision-language models, revealing current models' limitations and proposing improvements.

Contribution

The paper develops a cost-effective LLM-Human-in-the-Loop pipeline, creates the CURE benchmark for reasoning evaluation, and proposes a novel training framework to improve VLM reasoning and consistency.

Findings

01

Existing VLMs lack strong reasoning consistency.

02

The CURE benchmark effectively measures reasoning performance.

03

The proposed training framework improves reasoning accuracy and consistency.

Abstract

Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can parse natural queries about the visual content and generate human-like outputs. In this work, we explore the ability of these models to demonstrate human-like reasoning based on the perceived information. To address a crucial concern regarding the extent to which their reasoning capabilities are fully consistent and grounded, we also measure the reasoning consistency of these models. We achieve this by proposing a chain-of-thought (CoT) based consistency measure. However, such an evaluation requires a benchmark that encompasses both high-level inference and detailed reasoning chains, which is costly. We tackle this challenge by proposing a LLM-Human-in-the-Loop pipeline, which notably reduces cost while simultaneously ensuring the generation of a high-quality dataset. Based on this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yangyi-chen/cotconsistency
noneOfficial

Videos

Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Domain Adaptation and Few-Shot Learning