Revisiting the Data Sampling in Multimodal Post-training from a Difficulty-Distinguish View
Jianyu Qi, Ding Zou, Wenrui Yan, Rui Ma, Jiaxu Li, Zhijie Zheng, Zhiguo Yang, Rongchang Zhao

TL;DR
This paper introduces difficulty-aware sampling strategies for multimodal post-training of large language models, improving reasoning and perception by strategically selecting training samples based on quantifiable difficulty metrics.
Contribution
It proposes novel difficulty metrics and a hierarchical training framework that jointly optimize perception and reasoning in multimodal models, enhancing performance without supervised fine-tuning.
Findings
Difficulty-stratified sampling improves model accuracy.
GRPO on hard samples outperforms traditional fine-tuning.
Hierarchical training enhances perception and reasoning capabilities.
Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have spurred significant progress in Chain-of-Thought (CoT) reasoning. Building on the success of Deepseek-R1, researchers extended multimodal reasoning to post-training paradigms based on reinforcement learning (RL), focusing predominantly on mathematical datasets. However, existing post-training paradigms tend to neglect two critical aspects: (1) The lack of quantifiable difficulty metrics capable of strategically screening samples for post-training optimization. (2) Suboptimal post-training paradigms that fail to jointly optimize perception and reasoning capabilities. To address this gap, we propose two novel difficulty-aware sampling strategies: Progressive Image Semantic Masking (PISM) quantifies sample hardness through systematic image degradation, while Cross-Modality Attention Balance (CMAB) assesses cross-modal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
