The Cost of Avoiding Backpropagation

Kunjal Panchal; Sunav Choudhary; Yuriy Brun; Hui Guan

arXiv:2506.21833·cs.LG·June 30, 2025

The Cost of Avoiding Backpropagation

Kunjal Panchal, Sunav Choudhary, Yuriy Brun, Hui Guan

PDF

Open Access

TL;DR

This paper compares backpropagation, forward-mode automatic differentiation, and zero-order optimization, revealing that BP with checkpointing remains superior in accuracy, speed, and efficiency for training large models under memory constraints.

Contribution

It provides the first comprehensive theoretical and empirical comparison of these methods, highlighting the limitations of FmAD and ZO and reaffirming BP with checkpointing as the best approach.

Findings

01

BP with checkpointing outperforms FmAD and ZO in accuracy and speed

02

FAD and ZO incur higher costs in accuracy and computation

03

Results favor BP with checkpointing for memory-constrained training

Abstract

Forward-mode automatic differentiation (FmAD) and zero-order (ZO) optimization have been proposed as memory-efficient alternatives to backpropagation (BP) for gradient computation, especially in low-resource settings. However, their practical benefits remain unclear due to two key gaps: a lack of comparison against memory-efficient BP variants, such as activation checkpointing, and a lack of a unified theoretical analysis. This work presents a comprehensive theoretical and empirical comparison of BP, FmAD, and ZO methods. Our theoretical analysis shows that while FmAD, and ZO can reduce memory usage, they incur significant costs in accuracy, convergence speed, and computation compared to BP with checkpointing. These drawbacks worsen with larger models or constrained perturbation budgets. Empirical experiments on large language and vision-language models show that BP with checkpointing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning