Improve Vision Language Model Chain-of-thought Reasoning
Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun,, Zhe Gan, Yinfei Yang, Ruoming Pang, and Yiming Yang

TL;DR
This paper enhances vision language models' reasoning by enriching training data with GPT-4o rationales and applying reinforcement learning, leading to improved interpretability and generalization in reasoning tasks.
Contribution
It introduces a method to incorporate detailed rationales and reinforcement learning to significantly improve CoT reasoning in vision language models.
Findings
Improved CoT reasoning performance on benchmark datasets
Enhanced generalization to direct answer prediction
Effective use of GPT-4o generated rationales and reinforcement learning
Abstract
Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness. However, current training recipes lack robust CoT reasoning data, relying on datasets dominated by short annotations with minimal rationales. In this work, we show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses. To address this, we propose a two-fold approach. First, we distill rationales from GPT-4o model to enrich the training data and fine-tune VLMs, boosting their CoT performance. Second, we apply reinforcement learning to further calibrate reasoning quality. Specifically, we construct positive (correct) and negative (incorrect) pairs of model-generated reasoning chains, by comparing their predictions with annotated short answers. Using this pairwise data, we apply the Direct Preference…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The dataset of CoTs is likely to be useful in future work - The results show that finetuning on CoTs is helpful in several tasks - The results show marginal improvements from the DPO process - There are several ablations on dataset composition and methods of training/evaluating the model, which are helpful
- The improvements from DPO (in Tables 6, 7 and Figure 7) are quite small. (Are they statistically significant?) - The related work section does not discuss the mentioned work in enough detail (and could probably benefit from inclusion of more related work). For instance, what are the differences between the method in this paper and methods in the prior work that finetunes VLMs on CoTs? The statement that "Shao et al. (2024) trains VLMs for chain-of-thought (CoT) reasoning in object localization
A clear two-stage approach is proposed to improve the chain-of-thought reasoning and the final performance of vision-language models. The overall research goal and method are easy to understand.
1. The novelty in the proposed two-stage approach is quite limited, and similar strategies have been widely adopted in previous work. For the first stage that distills CoT reasoning chain from the teacher models GPT-4o, this kind of method has been implemented too many of times in different domains, different modality, and different purposes. For the second stage that prompt VLM to first generate the CoT solution and then check the correctness of the final answers by comparing with the ground-tr
1. The overall design makes sense. As demonstrated by the benchmark results, this could be a possible choice for implementing vision-language applications. 2. The release of synthetic data generated by GPT-4o contributes to the VLM finetuning. 3. The failure analysis in this paper is insightful and potentially helpful for other VLM research.
The novelty is unclear. - From the idea level, tuning for CoT ability is being actively explored in LLM-centric research (e.g., [1], [2], [3]) - For the overall design, the proposed method has recently been widely used. Including, - Step 1 directly leverages the commonly seen approach, i.e., knowledge distillation from a larger teacher model (GPT-4). (e.g., [4]) - Two-stage tuning (SFT-RL) has been shown to be effective in improving reasoning ability in many LLM works, either through
- The paper is well-written and thus easy to understanding. - The motivation to equip VLMs with CoT capability is interesting. - The two-stage method demonstrates improved accuracy across several ImageQA benchmark.
[Setting] 1. Lack of discussion on the reasons why the VLM lacks CoT reasoning capabilities, even when its LLM backbone possesses CoT capabilities. If the LLM backbone has already equipped with CoT capability, why does the VLM not retain this capability? On the other hand, if the LLM backbone itself lacks CoT capability, it raises the question of why improvements were not made directly to the CoT capabilities of the LLM. 2. Does llama3-8b have CoT capabilities (I ask this question because your b
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
