Performance Characterization of Distributed Deep Learning Strategies: A Quantitative Evaluation of DDP, FSDP, and Parameter Server Architectures on GPU Clusters

Md Sultanul Islam Ovi

arXiv:2505.12832·cs.DC·January 6, 2026

Performance Characterization of Distributed Deep Learning Strategies: A Quantitative Evaluation of DDP, FSDP, and Parameter Server Architectures on GPU Clusters

Md Sultanul Islam Ovi

PDF

Open Access

TL;DR

This paper empirically compares distributed deep learning strategies—DDP, FSDP, and Parameter Server—on GPU clusters, analyzing their performance, memory efficiency, and accuracy impacts to guide system design choices.

Contribution

It provides a comprehensive, side-by-side evaluation of the three main distributed training paradigms across different hardware setups, highlighting their trade-offs and optimal use cases.

Findings

01

FSDP reduces peak memory usage by 4-6x, aiding memory-constrained training.

02

Asynchronous Parameter Server speeds up training by 28% but causes 4-17% accuracy loss.

03

DPP offers 2-3x throughput speedup on high-performance clusters.

Abstract

Efficiently scaling deep neural networks across GPU clusters requires navigating complex trade-offs between computational throughput, memory utilization, and synchronization overhead. This paper presents a unified empirical evaluation of three dominant distributed training paradigms: Distributed Data Parallel (DDP), Fully Sharded Data Parallel (FSDP), and the Parameter Server (PS) architecture. We conduct side-by-side benchmarking on both high-performance (NVIDIA A100) and commodity-class (NVIDIA A10G) clusters to isolate the impact of communication bandwidth and gang-scheduling dependencies. Our results indicate that while DDP achieves a 2-3x speedup in training throughput for standard architectures, FSDP demonstrates a 4-6x reduction in peak memory usage, validating its utility for memory-constrained environments despite higher communication latency. Furthermore, we evaluate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Cloud Computing and Resource Management · Advanced Neural Network Applications