Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions
Xiaoyi Li

TL;DR
This study presents a large-scale, controlled comparison of 51 post-training algorithms across different model sizes, revealing that algorithm rankings are highly scale-dependent and that model scale has the most significant impact on performance.
Contribution
Introduces OXRL, a unified framework for fair comparison of post-training algorithms, and provides the first large-scale analysis showing scale-dependent ranking inversions and task-specific algorithm leverage.
Findings
Algorithm rankings invert with model scale.
Loss function modifications have negligible effects.
Algorithm leverage varies significantly across tasks.
Abstract
Post-training alignment has produced dozens of competing algorithms -- DPO, SimPO, KTO, GRPO, and others -- yet practitioners lack controlled comparisons to guide algorithm selection. We present OXRL, a unified framework implementing 51 post-training algorithms with identical infrastructure, enabling the first large-scale apples-to-apples evaluation. Our study spans 8 algorithms across 4 model scales (0.5B--7B), 3 evaluation domains, and a 20-variant DPO taxonomy (100 runs at 1.5B, 5 seeds each), totaling 240 training runs on H100 GPUs. Three headline findings emerge. (1)~Algorithm rankings are unstable across scale: at 1.5B, online RL (SGRPO) tops all methods at 58.0\%~0.57 on GSM8K; by 7B, the worst small-scale method (SimPO) becomes the best (85.8\%), a complete ranking inversion driven by model scale rather than LoRA regularization (confirmed via 22 factorial).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Advanced Neural Network Applications · Stochastic Gradient Optimization Techniques
