Closing the Gap: Data-Centric Fine-Tuning of Vision Language Models for the Standardized Exam Questions

Egemen Sert; \c{S}eyda Ertekin

arXiv:2512.00042·cs.CV·December 2, 2025

Closing the Gap: Data-Centric Fine-Tuning of Vision Language Models for the Standardized Exam Questions

Egemen Sert, \c{S}eyda Ertekin

PDF

Open Access

TL;DR

This paper demonstrates that high-quality, curriculum-aligned multimodal data and optimized reasoning syntax can significantly improve vision-language models' performance on standardized exam questions, rivaling proprietary approaches.

Contribution

It introduces a large, curated multimodal dataset and an optimized fine-tuning approach that together enhance vision-language reasoning on exam questions, emphasizing data quality and syntax.

Findings

01

Achieved 78.6% accuracy on YKSUniform benchmark

02

Data composition and syntax are crucial for multimodal reasoning

03

Supervised fine-tuning with curated data rivals proprietary models

Abstract

Multimodal reasoning has become a cornerstone of modern AI research. Standardized exam questions offer a uniquely rigorous testbed for such reasoning, providing structured visual contexts and verifiable answers. While recent progress has largely focused on algorithmic advances such as reinforcement learning (e.g., GRPO, DPO), the data centric foundations of vision language reasoning remain less explored. We show that supervised fine-tuning (SFT) with high-quality data can rival proprietary approaches. To this end, we compile a 161.4 million token multimodal dataset combining textbook question-solution pairs, curriculum aligned diagrams, and contextual materials, and fine-tune Qwen-2.5VL-32B using an optimized reasoning syntax (QMSA). The resulting model achieves 78.6% accuracy, only 1.0% below Gemini 2.0 Flash, on our newly released benchmark YKSUniform, which standardizes 1,854…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques