Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation

Jiaer Xia; Bingkui Tong; Yuhang Zang; Rui Shao; Kaiyang Zhou

arXiv:2507.02859·cs.CV·July 4, 2025

Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation

Jiaer Xia, Bingkui Tong, Yuhang Zang, Rui Shao, Kaiyang Zhou

PDF

TL;DR

This paper introduces Grounded Chain-of-Thought (GCoT), a bootstrapping method that improves multimodal language models' adaptation to specialized vision tasks by grounding reasoning steps in input images, especially with limited data.

Contribution

The paper proposes GCoT, a novel bootstrapping approach that injects grounding information into CoT data to enhance model adaptation to specialized vision tasks with limited data.

Findings

01

GCoT significantly improves performance in data-limited regimes.

02

Grounding reasoning steps increases faithfulness to input images.

03

Approach outperforms fine-tuning and distillation methods.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in interpreting images using natural language. However, without using large-scale datasets for retraining, these models are difficult to adapt to specialized vision tasks, e.g., chart understanding. This problem is caused by a mismatch between pre-training and downstream datasets: pre-training datasets primarily concentrate on scenes and objects but contain limited information about specialized, non-object images, such as charts and tables. In this paper, we share an interesting finding that training an MLLM with Chain-of-Thought (CoT) reasoning data can facilitate model adaptation in specialized vision tasks, especially under data-limited regimes. However, we identify a critical issue within CoT data distilled from pre-trained MLLMs, i.e., the data often contains multiple factual errors in the reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.