The Cost of Avoiding Backpropagation
Kunjal Panchal, Sunav Choudhary, Yuriy Brun, Hui Guan

TL;DR
This paper compares backpropagation, forward-mode automatic differentiation, and zero-order optimization, revealing that BP with checkpointing remains superior in accuracy, speed, and efficiency for training large models under memory constraints.
Contribution
It provides the first comprehensive theoretical and empirical comparison of these methods, highlighting the limitations of FmAD and ZO and reaffirming BP with checkpointing as the best approach.
Findings
BP with checkpointing outperforms FmAD and ZO in accuracy and speed
FAD and ZO incur higher costs in accuracy and computation
Results favor BP with checkpointing for memory-constrained training
Abstract
Forward-mode automatic differentiation (FmAD) and zero-order (ZO) optimization have been proposed as memory-efficient alternatives to backpropagation (BP) for gradient computation, especially in low-resource settings. However, their practical benefits remain unclear due to two key gaps: a lack of comparison against memory-efficient BP variants, such as activation checkpointing, and a lack of a unified theoretical analysis. This work presents a comprehensive theoretical and empirical comparison of BP, FmAD, and ZO methods. Our theoretical analysis shows that while FmAD, and ZO can reduce memory usage, they incur significant costs in accuracy, convergence speed, and computation compared to BP with checkpointing. These drawbacks worsen with larger models or constrained perturbation budgets. Empirical experiments on large language and vision-language models show that BP with checkpointing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
