The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task
Yifan Wu, Pengchuan Zhang, Wenhan Xiong, Barlas Oguz, James C. Gee,, Yixin Nie

TL;DR
This paper investigates how the Chain-of-Thought reasoning approach enhances complex vision-language tasks by introducing a 'Description then Decision' strategy that significantly improves performance.
Contribution
It introduces a novel 'Description then Decision' strategy inspired by human processing, demonstrating substantial performance improvements in vision-language reasoning tasks.
Findings
Probes show a 50% performance increase with the new strategy.
The approach effectively decomposes complex reasoning in vision-language tasks.
Lays groundwork for future reasoning paradigm research.
Abstract
The study explores the effectiveness of the Chain-of-Thought approach, known for its proficiency in language tasks by breaking them down into sub-tasks and intermediate steps, in improving vision-language tasks that demand sophisticated perception and reasoning. We present the "Description then Decision" strategy, which is inspired by how humans process signals. This strategy significantly improves probing task performance by 50%, establishing the groundwork for future research on reasoning paradigms in complex vision-language tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCognitive Science and Mapping · Categorization, perception, and language
