SPARD: Self-Paced Curriculum for RL Alignment via Integrating Reward Dynamics and Data Utility
Xuyang Zhi, Peilun zhou, Chengqiang Lu, Hang Lv, Yiwei Liang, Rongyang Zhang, Yan Gao, YI WU, Yao Hu, Hongchao Gu, Defu Lian, Hao Wang, Enhong Chen

TL;DR
SPARD introduces a self-paced curriculum framework that dynamically adjusts reward weights and data importance based on learning progress, improving LLM training in complex, multi-objective scenarios.
Contribution
It presents an automated, self-paced curriculum method that accounts for non-stationary dynamics and data heterogeneity in RL alignment for large language models.
Findings
SPARD improves model capabilities across multiple benchmarks.
Dynamic reward weighting enhances learning efficiency.
The framework adapts to complex, multi-objective reward systems.
Abstract
The evolution of Large Language Models (LLMs) is shifting the focus from single, verifiable tasks toward complex, open-ended real-world scenarios, imposing significant challenges on the post-training phase. In these settings, the scale and complexity of reward systems have grown significantly, transitioning toward multi-objective formulations that encompass a comprehensive spectrum of model capabilities and application contexts. However, traditional methods typically rely on fixed reward weights, ignoring non-stationary learning dynamics and struggling with data heterogeneity across dimensions. To address these issues, we propose SPARD, a framework that establishes an automated, self-paced curriculum by perceiving learning progress to dynamically adjust multi-objective reward weights and data importance, thereby synchronizing learning intent with data utility for optimal performance.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
