Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation
Jiaer Xia, Bingkui Tong, Yuhang Zang, Rui Shao, Kaiyang Zhou

TL;DR
This paper introduces Grounded Chain-of-Thought (GCoT), a bootstrapping method that improves multimodal language models' adaptation to specialized vision tasks by grounding reasoning steps in input images, especially with limited data.
Contribution
The paper proposes GCoT, a novel bootstrapping approach that injects grounding information into CoT data to enhance model adaptation to specialized vision tasks with limited data.
Findings
GCoT significantly improves performance in data-limited regimes.
Grounding reasoning steps increases faithfulness to input images.
Approach outperforms fine-tuning and distillation methods.
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in interpreting images using natural language. However, without using large-scale datasets for retraining, these models are difficult to adapt to specialized vision tasks, e.g., chart understanding. This problem is caused by a mismatch between pre-training and downstream datasets: pre-training datasets primarily concentrate on scenes and objects but contain limited information about specialized, non-object images, such as charts and tables. In this paper, we share an interesting finding that training an MLLM with Chain-of-Thought (CoT) reasoning data can facilitate model adaptation in specialized vision tasks, especially under data-limited regimes. However, we identify a critical issue within CoT data distilled from pre-trained MLLMs, i.e., the data often contains multiple factual errors in the reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
