LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu; Peng Jin; Ziang Wu; Hao Li; Yibing Song; Lichao Sun; Li Yuan

arXiv:2411.10440·cs.CV·July 22, 2025·3 cites

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, Li Yuan

PDF

Open Access 2 Repos 2 Models 5 Datasets

TL;DR

LLaVA-CoT introduces a structured, multistage reasoning approach for vision-language models, significantly improving performance on complex visual reasoning tasks with minimal training data and test-time scaling.

Contribution

The paper presents LLaVA-CoT, a novel vision-language model with autonomous multistage reasoning, a new dataset with structured annotations, and a test-time retracing method, outperforming larger models.

Findings

01

Achieves 9.4% improvement on reasoning benchmarks

02

Outperforms larger and closed-source models like GPT-4o-mini

03

Effective with only 100k training samples

Abstract

Large language models have demonstrated substantial advancements in reasoning capabilities. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-CoT, a large VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-CoT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-CoT to achieve marked improvements on reasoning-intensive tasks. To accomplish this, we construct the LLaVA-CoT-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose a test-time stage-wise retracing search method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsBalanced Selection