Tuning the burn-in phase in training recurrent neural networks improves their performance

Julian D. Schiller; Malte Heinrich; Victor G. Lopez; Matthias A. M\"uller

arXiv:2602.10911·cs.LG·February 12, 2026

Tuning the burn-in phase in training recurrent neural networks improves their performance

Julian D. Schiller, Malte Heinrich, Victor G. Lopez, Matthias A. M\"uller

PDF

Open Access 3 Reviews

TL;DR

This paper investigates how tuning the burn-in phase during truncated backpropagation through time can significantly enhance the training efficiency and accuracy of recurrent neural networks on time series tasks.

Contribution

It provides theoretical bounds on performance loss due to truncated training and highlights the importance of burn-in tuning for improved RNN training outcomes.

Findings

01

Proper burn-in tuning reduces prediction error by over 60%.

02

Theoretical bounds link burn-in length to training accuracy.

03

Experimental validation on benchmarks confirms the impact of burn-in tuning.

Abstract

Training recurrent neural networks (RNNs) with standard backpropagation through time (BPTT) can be challenging, especially in the presence of long input sequences. A practical alternative to reduce computational and memory overhead is to perform BPTT repeatedly over shorter segments of the training data set, corresponding to truncated BPTT. In this paper, we examine the training of RNNs when using such a truncated learning approach for time series tasks. Specifically, we establish theoretical bounds on the accuracy and performance loss when optimizing over subsequences instead of the full data sequence. This reveals that the burn-in phase of the RNN is an important tuning knob in its training, with significant impact on the performance guarantees. We validate our theoretical results through experiments on standard benchmarks from the fields of system identification and time series…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

The proposed method is easy yet effective. The author provides theoretical proofs for their proposed method. Theorem 1.1 An error bound is developed between the model with different initial hidden states, shows that the difference converges if the model has decaying memory (lambda small) Theorem 1.2 A bound for training regret is developed. This is the difference between ideal case approximation error where an ideal initial state is known and the approximation of using subsequence with burn i

Weaknesses

Some of the concept and terms does not have enough motivation, and the intuition and explanation is not very clear for the Theorems, which can cause non expert hard to follow the results. There is no recall of notations in Theorem part. Need frequently refer back to find notations. Certain part of the experiments can be improved. Details please see questions.

Reviewer 02Rating 4Confidence 4

Strengths

- This work analyzes TBPTT with zero initialized hidden states from the lens of regret w.r.t. training benchmark and shows that the burn-in period (m) affects the performance guarantees, both training regret and performance regret. - Experimental evaluations on the synthetic data survey as a good validation of the proposed bounds. It shows that larger values of the burn-in phase (m) allow for longer transient phases, and enables RNN to exploit the exploration to reach closer to the benchmark out

Weaknesses

- Although the theoretical bounds are valid in the synthetic data setup, the theoretical analysis starts to dwindle down as we reach towards harder real-world setup - It is unclear how the assumption 1 holds true in realistic settings with more complex choice of RNNs (such as LSTMs, GRUs, etc.) - Although theoretically the burn-in period helps in simplifying and analyzing the regret, it is unclear if the burn-in period is really the bottleneck in more sophisticated TBPTT (such as non-zero hidden

Reviewer 03Rating 4Confidence 2

Strengths

The paper combines both theoretical and empirical research. It well structured and generally clearly written.

Weaknesses

The theoretical and practical consequences of this work are unclear. The most detailed empirical results are provided for a synthetic task that is so simple it is unlikely to have any practical relevance.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Reservoir Computing · Stock Market Forecasting Methods · Neural Networks and Applications