Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

Ang Li; Charles Wang; Deqing Fu; Kaiyu Yue; Zikui Cai; Wang Bill Zhu; Ollie Liu; Peng Guo; Willie Neiswanger; Furong Huang; Tom Goldstein; Micah Goldblum

arXiv:2507.16746·cs.CV·October 10, 2025

Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

Ang Li, Charles Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, Tom Goldstein, Micah Goldblum

PDF

Open Access 3 Models 1 Datasets

TL;DR

Zebra-CoT introduces a large-scale dataset with interleaved text-image reasoning traces to improve multimodal reasoning in AI models, addressing current performance and data scarcity challenges.

Contribution

The paper presents Zebra-CoT, a diverse dataset of over 180,000 samples for training visual chain of thought models, enabling significant performance improvements.

Findings

01

+12% accuracy on test set after fine-tuning

02

Up to +13% gain on VLM benchmarks

03

High-quality visual reasoning chains generated

Abstract

Humans often use visual aids, for example diagrams or sketches, when solving complex problems. Training multimodal models to do the same, known as Visual Chain of Thought (Visual CoT), is challenging due to: (1) poor off-the-shelf visual CoT performance, which hinders reinforcement learning, and (2) the lack of high-quality visual CoT training data. We introduce $Zebra-CoT$ , a diverse large-scale dataset with 182,384 samples, containing logically coherent interleaved text-image reasoning traces. We focus on four categories of tasks where sketching or visual reasoning is especially natural, spanning scientific questions such as geometry, physics, and algorithms; 2D visual reasoning tasks like visual search and jigsaw puzzles; 3D reasoning tasks including 3D multi-hop inference, embodied and robot planning; visual logic problems and strategic games like chess. Fine-tuning the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

multimodal-reasoning-lab/Zebra-CoT
dataset· 5.4k dl
5.4k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques