Model Parallelism With Subnetwork Data Parallelism
Vaibhav Singh, Zafir Khalid, Edouard Oyallon, Eugene Belilovsky

TL;DR
This paper presents Subnetwork Data Parallelism (SDP), a novel distributed training method that partitions models into subnetworks to reduce memory and communication costs without sacrificing accuracy.
Contribution
The paper introduces SDP with two masking regimes and subnetwork strategies, enabling efficient large-scale training across CNNs, transformers, and LLMs.
Findings
SDP reduces memory usage by 30%-75% on various models.
Forward masking can outperform traditional methods at similar FLOPs.
SDP maintains or improves performance despite reduced resource consumption.
Abstract
Pre-training large neural networks at scale imposes heavy memory demands on accelerators and often requires costly communication. We introduce Subnetwork Data Parallelism (SDP), a distributed training framework that partitions a model into structured subnetworks trained across workers without exchanging activations. We study two complementary masking regimes: backward masking, which applies sparsity only in the backward step to retain unbiased gradients, and forward masking, which also removes parameters in the forward pass to deliver stronger efficiency gains while providing additional regularization. We further explore two subnetwork construction strategies: neuron level and block level, applied across both CNNs and transformers. In experiments spanning CNNs and transformers on CIFAR and ImageNet, as well as LLM pre-training on FineWeb, SDP reduces per-device memory usage by 30%-75%…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper is tackling an important practical topic (improving the memory efficiency of training). 2. The paper is clear and very well-written. 3. There are theoretical derivations showing convergence for the backward-masking case (and the paper notes they do not have this guarantee for forward-masking). 4. The experimental evaluation covers multiple modalities (image classification, LLMs), models (ResNets, Swin, and Llama), and datasets (CIFAR-10/100, ImageNet, FineWeb). 5. The incorporation
Systems: Overall, the paper's claims for systems impact (e.g., reduced memory) are qualitative, not quantitative. I would like to see additional empirical measurements and baselines to supplement the discussion. 1. The paper would benefit from a detailed runtime performance analysis. Experimentally (e.g., using PyTorch's memory analysis tools), how much memory is used in each run? What is the runtime (e.g., time per mini-batch and a plot of training curve versus time)? What is the throughput (fl
1. **Original formulation of distributed training**. The paper proposes subnetwork data parallelism (SDP), a novel formulation that bridges data and model parallelism by training structured subnetworks independently and synchronizing only overlapping parameters. This framing is original and relevant for memory-limited distributed training. 2. **Comprehensive experimental coverage**. The method is evaluated across diverse architectures (ResNet, SwinT, and LLaMA) showing consistent performance tre
1. **Lack of empirical evidence on actual efficiency gains**. While the paper claims substantial memory and communication reduction, it does not provide quantitative measurements of actual GPU memory usage or training wall-clock time. This makes it difficult to verify the claimed advantages. Even if activation memory dominates at a per-GPU batch size of 64, the authors could still report memory statistics or time comparisons (forward, backward, and synchronization) against DDP for fairness. With
1. The proposed method seems general to all model architecture. 2. The paper provides convergence guarantees to backward masking. 3. Experiments shows significant improvement on the memory.
1. The theoretical investigations on the communication costs and precision loss are missing. 2. Insufficient comparison methods are compared with the proposed methods. 3. Extensive experiments on different architectures, model size, and efficiency analysis are missing, making the paper unconvincing.
1. SDP proposes a new perspective: training overlapping subnetworks that maintain full paths from input to loss, avoiding activation exchange. This is an elegant idea that could inspire new hybridparallel designs. 2. The authors test across architectures (ResNet, Swin-T, LLaMA-style transformers) and datasets (CIFAR-10/100, ImageNet, FineWeb). The breadth of experiments strengthens the paper’s credibility.
1. The forward masking variant, which empirically performs best, lacks theoretical backing. This weakens the conceptual balance between theory and practice. 2. As an experiment-focus paper, the paper lacks in comparison with important baselines. For example, the results of directly applying Dropout [1] and the stochastic depth [2] are not included, which are the inspirations of the two variants of SDP. 3. In the experiments, the training epochs are extended inversely with C to have a FLOP-matc
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Advanced Database Systems and Queries · Scientific Computing and Data Management
