Closing the Gap: Data-Centric Fine-Tuning of Vision Language Models for the Standardized Exam Questions
Egemen Sert, \c{S}eyda Ertekin

TL;DR
This paper demonstrates that high-quality, curriculum-aligned multimodal data and optimized reasoning syntax can significantly improve vision-language models' performance on standardized exam questions, rivaling proprietary approaches.
Contribution
It introduces a large, curated multimodal dataset and an optimized fine-tuning approach that together enhance vision-language reasoning on exam questions, emphasizing data quality and syntax.
Findings
Achieved 78.6% accuracy on YKSUniform benchmark
Data composition and syntax are crucial for multimodal reasoning
Supervised fine-tuning with curated data rivals proprietary models
Abstract
Multimodal reasoning has become a cornerstone of modern AI research. Standardized exam questions offer a uniquely rigorous testbed for such reasoning, providing structured visual contexts and verifiable answers. While recent progress has largely focused on algorithmic advances such as reinforcement learning (e.g., GRPO, DPO), the data centric foundations of vision language reasoning remain less explored. We show that supervised fine-tuning (SFT) with high-quality data can rival proprietary approaches. To this end, we compile a 161.4 million token multimodal dataset combining textbook question-solution pairs, curriculum aligned diagrams, and contextual materials, and fine-tune Qwen-2.5VL-32B using an optimized reasoning syntax (QMSA). The resulting model achieves 78.6% accuracy, only 1.0% below Gemini 2.0 Flash, on our newly released benchmark YKSUniform, which standardizes 1,854…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
