Revisiting the Data Sampling in Multimodal Post-training from a Difficulty-Distinguish View

Jianyu Qi; Ding Zou; Wenrui Yan; Rui Ma; Jiaxu Li; Zhijie Zheng; Zhiguo Yang; Rongchang Zhao

arXiv:2511.06722·cs.CV·November 11, 2025

Revisiting the Data Sampling in Multimodal Post-training from a Difficulty-Distinguish View

Jianyu Qi, Ding Zou, Wenrui Yan, Rui Ma, Jiaxu Li, Zhijie Zheng, Zhiguo Yang, Rongchang Zhao

PDF

Open Access 1 Video

TL;DR

This paper introduces difficulty-aware sampling strategies for multimodal post-training of large language models, improving reasoning and perception by strategically selecting training samples based on quantifiable difficulty metrics.

Contribution

It proposes novel difficulty metrics and a hierarchical training framework that jointly optimize perception and reasoning in multimodal models, enhancing performance without supervised fine-tuning.

Findings

01

Difficulty-stratified sampling improves model accuracy.

02

GRPO on hard samples outperforms traditional fine-tuning.

03

Hierarchical training enhances perception and reasoning capabilities.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have spurred significant progress in Chain-of-Thought (CoT) reasoning. Building on the success of Deepseek-R1, researchers extended multimodal reasoning to post-training paradigms based on reinforcement learning (RL), focusing predominantly on mathematical datasets. However, existing post-training paradigms tend to neglect two critical aspects: (1) The lack of quantifiable difficulty metrics capable of strategically screening samples for post-training optimization. (2) Suboptimal post-training paradigms that fail to jointly optimize perception and reasoning capabilities. To address this gap, we propose two novel difficulty-aware sampling strategies: Progressive Image Semantic Masking (PISM) quantifies sample hardness through systematic image degradation, while Cross-Modality Attention Balance (CMAB) assesses cross-modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Revisiting the Data Sampling in Multimodal Post-training from a Difficulty-Distinguish View· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling