Revisiting Parameter Server in LLM Post-Training
Xinyi Wan, Penghui Qi, Guangxing Huang, Chaoyi Ruan, Min Lin, Jialin Li

TL;DR
This paper introduces On-Demand Communication (ODC), a novel approach that enhances parameter server robustness and efficiency in large language model post-training by replacing collective communication with direct point-to-point communication, leading to significant speedups.
Contribution
The paper proposes ODC, a new communication method that adapts Fully Sharded Data Parallelism for imbalanced workloads in LLM post-training, improving device utilization and training speed.
Findings
Up to 36% speedup over FSDP.
Reduces synchronization barriers from once per layer to once per minibatch.
Decouples device workloads for better load balancing.
Abstract
Modern data parallel (DP) training favors collective communication over parameter servers (PS) for its simplicity and efficiency under balanced workloads. However, the balanced workload assumption no longer holds in large language model (LLM) post-training due to the high variance in sequence lengths. Under imbalanced workloads, collective communication creates synchronization barriers, leading to under-utilization of devices with smaller workloads. This change in training dynamics calls for a revisit of the PS paradigm for its robustness to such imbalance. We propose \textbf{On-Demand Communication (ODC)}, which adapts PS into Fully Sharded Data Parallel (FSDP) by replacing collective all-gather and reduce-scatter with direct point-to-point communication. Compared to FSDP, ODC reduces the synchronization barrier from once per layer to once per minibatch and decouples the workload on…
Peer Reviews
Decision·ICLR 2026 Poster
- Well-motivated and straightforward to implement. - An excellent complement to FSDP that only requires changing communication operators—theoretically and experimentally equivalent (minor precision differences from batch size variations have minimal impact).
The practical applications of ODC may be quite limited, requiring specific scenarios with load imbalance. In SFT, the paper only tested LongAlign and SWE-Smith; in RL, updating the actor is not the main bottleneck, and currently, partial-rollout or fully asynchronous training are more commonly used to improve overall system throughput.
- Novelty: The proposed method adapts the parameter server concept into the FSDP framework by replacing collective all-gather/reduce-scatter operations with point-to-point communication, thereby reducing synchronization overhead. - Practicality: The authors implemented ODC and provided extensive experimental results demonstrating its reliability. - Readability: The paper is well-structured and easy to follow.
No obvious drawbacks.
1. good idea to call back to parameter server. Personally I like this call-back. 2. clear presentation and experimental results
1. With PS, one major issue is we have inconsistency of parameters (i.e. parameter delay). For example, two GPUs , GPU0 pull the weights before some xyz updates, then GPU1 pull the same weights with xyz updates, then they are training on different weights but same iteration number, This parameter inconsistency will hurt model convergences. There is no discussion on model convergence in the whole paper. 2. Indeed, FSDP/ZeRO they already incorporated async parameter pre-fetch, hiding weights upda
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Parallel Computing and Optimization Techniques · IoT and Edge/Fog Computing
