MAPLE: Modality-Aware Post-training and Learning Ecosystem
Nikhil Verma, Minjung Kim, JooYoung Yoo, Kyung-Min Jin, Manasa Bharadwaj, Kevin Ferreira, Ko Keun Kim, Youngjoon Kim

TL;DR
MAPLE introduces a modality-aware post-training ecosystem for multimodal language models, improving robustness, convergence speed, and accuracy by explicitly considering modality relevance and optimizing training strategies.
Contribution
It presents MAPLE, including a new benchmark, a modality-aware policy optimization framework, and adaptive curriculum strategies, addressing limitations of existing modality-blind training methods.
Findings
Reduces policy-gradient variance and improves convergence speed.
Narrowed uni/multi-modal accuracy gaps by over 30%.
Maintains stability across various modality combinations.
Abstract
Multimodal language models now integrate text, audio, and video for unified reasoning. Yet existing RL post-training pipelines treat all input signals as equally relevant, ignoring which modalities each task actually requires. This modality-blind training inflates policy-gradient variance, slows convergence, and degrades robustness to real-world distribution shifts where signals may be missing, added, or reweighted. We introduce MAPLE, a complete modality-aware post-training and learning ecosystem comprising: (1) MAPLE-bench, the first benchmark explicitly annotating minimal signal combinations required per task; (2) MAPO, a modality-aware policy optimization framework that stratifies batches by modality requirement to reduce gradient variance from heterogeneous group advantages; (3) Adaptive weighting and curriculum scheduling that balances and prioritizes harder signal combinations.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Speech and dialogue systems
