FlipVQA: Scaling Multi-modal Instruction Tuning via Textbook-to-Knowledge Synthesis
Zhen Hao Wong, Jingwen Deng, Yuzhao Wang, Wenkai Yu, Jihao Huang, Runming He, Chengyu Shen, Hao Liang, Wentao Zhang

TL;DR
FlipVQA introduces an automated pipeline to extract and curate large-scale question-answering datasets from textbooks, significantly reducing costs while maintaining high data quality for multi-modal reasoning tasks.
Contribution
The paper presents FlipVQA-Miner, a novel method for extracting structured QA and VQA pairs from complex textbook layouts, enabling scalable, high-fidelity data generation.
Findings
Constructed FlipVQA-83K dataset with 83,000 QA pairs across 11 disciplines
Achieved 50x cost savings compared to manual annotation
Models trained on FlipVQA-83K show improved reasoning and generalization
Abstract
Textbooks are among the richest repositories of human-verified reasoning knowledge, yet their complex layouts contain multi-column typesetting, cross-page question answer separation, and interleaved figures, make automated extraction of structured QA and VQA pairs extremely challenging. Existing alternatives either synthesize data from scratch, which lacks authentic problem contexts, or rely on costly expert annotation that cannot scale. We propose , an automated pipeline that resolves long-range logical dependencies and cross-page discontinuities in OCR-parsed documents, recovering coherent question--answer--figure associations even when answers reside in separate companion volumes. A subsequent multi-stage curation pipeline transforms these raw extractions into AI-ready supervision signals. Using FlipVQA-Miner, we construct , comprising…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
