FlipVQA: Scaling Multi-modal Instruction Tuning via Textbook-to-Knowledge Synthesis

Zhen Hao Wong; Jingwen Deng; Yuzhao Wang; Wenkai Yu; Jihao Huang; Runming He; Chengyu Shen; Hao Liang; Wentao Zhang

arXiv:2511.16216·cs.AI·March 31, 2026

FlipVQA: Scaling Multi-modal Instruction Tuning via Textbook-to-Knowledge Synthesis

Zhen Hao Wong, Jingwen Deng, Yuzhao Wang, Wenkai Yu, Jihao Huang, Runming He, Chengyu Shen, Hao Liang, Wentao Zhang

PDF

1 Repo 1 Datasets

TL;DR

FlipVQA introduces an automated pipeline to extract and curate large-scale question-answering datasets from textbooks, significantly reducing costs while maintaining high data quality for multi-modal reasoning tasks.

Contribution

The paper presents FlipVQA-Miner, a novel method for extracting structured QA and VQA pairs from complex textbook layouts, enabling scalable, high-fidelity data generation.

Findings

01

Constructed FlipVQA-83K dataset with 83,000 QA pairs across 11 disciplines

02

Achieved 50x cost savings compared to manual annotation

03

Models trained on FlipVQA-83K show improved reasoning and generalization

Abstract

Textbooks are among the richest repositories of human-verified reasoning knowledge, yet their complex layouts contain multi-column typesetting, cross-page question answer separation, and interleaved figures, make automated extraction of structured QA and VQA pairs extremely challenging. Existing alternatives either synthesize data from scratch, which lacks authentic problem contexts, or rely on costly expert annotation that cannot scale. We propose $FlipVQA-Miner$ , an automated pipeline that resolves long-range logical dependencies and cross-page discontinuities in OCR-parsed documents, recovering coherent question--answer--figure associations even when answers reside in separate companion volumes. A subsequent multi-stage curation pipeline transforms these raw extractions into AI-ready supervision signals. Using FlipVQA-Miner, we construct $FlipVQA-83K$ , comprising…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

OpenDCAI/DataFlow-VQA
github

Datasets

OpenDCAI/FlipVQA
dataset· 323 dl
323 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.