Data Organization Matters in Multimodal Instruction Tuning: A Controlled Study of Capability Trade-offs
Guowei Tang

TL;DR
This study investigates how the order of data presentation during training influences the capabilities of multimodal large language models, revealing that curriculum strategies improve structured reasoning and overall performance.
Contribution
It introduces a controlled three-stage training framework to systematically analyze the impact of data organization on multimodal model capabilities.
Findings
Curriculum training yields the best overall trade-off in capabilities.
Balanced sampling enhances OCR-oriented skills but reduces broader capabilities.
Reverse curriculum underperforms in performance and stability.
Abstract
Recent multimodal large language models (MLLMs) perform strongly on general visual understanding, diagram and chart reasoning, and document-centric perception. However, these abilities are learned from heterogeneous supervision sources with very different task structures and learning demands, and the effect of their temporal organization during training remains underexplored. We study whether data organization affects the trade-off among general understanding, structured reasoning, and fine-grained OCR/document understanding in multimodal instruction tuning. To isolate this factor, we use a controlled three-stage training framework in which the backbone, trainable modules, and optimization pipeline are fixed across all runs, and only the temporal arrangement of post-alignment supervision is changed. We compare four strategies: direct mixture, curriculum training, balanced sampling, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
