Adaptively Truncating Backpropagation Through Time to Control Gradient Bias
Christopher Aicher, Nicholas J. Foti, Emily B. Fox

TL;DR
This paper introduces an adaptive truncation method for backpropagation through time in RNNs, which dynamically controls gradient bias to improve training efficiency and convergence.
Contribution
It proposes a novel adaptive TBPTT scheme that adjusts truncation length based on gradient bias, supported by theoretical analysis and practical estimation methods.
Findings
Adaptive TBPTT reduces computational costs compared to fixed truncation.
The method improves convergence rates in training RNNs.
Experimental results show better performance on language modeling tasks.
Abstract
Truncated backpropagation through time (TBPTT) is a popular method for learning in recurrent neural networks (RNNs) that saves computation and memory at the cost of bias by truncating backpropagation after a fixed number of lags. In practice, choosing the optimal truncation length is difficult: TBPTT will not converge if the truncation length is too small, or will converge slowly if it is too large. We propose an adaptive TBPTT scheme that converts the problem from choosing a temporal lag to one of choosing a tolerable amount of gradient bias. For many realistic RNNs, the TBPTT gradients decay geometrically in expectation for large lags; under this condition, we can control the bias by varying the truncation length adaptively. For RNNs with smooth activation functions, we prove that this bias controls the convergence rate of SGD with biased gradients for our non-convex loss. Using this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Blind Source Separation Techniques · Advanced Adaptive Filtering Techniques
MethodsStochastic Gradient Descent
