Global temperature anomaly prediction by using additive twin LSTM networks

Cemal Keles; Burhan Baran; Baris Baykant Alagoz

PMC · DOI:10.1038/s41598-026-37255-x·January 28, 2026

Global temperature anomaly prediction by using additive twin LSTM networks

Cemal Keles, Burhan Baran, Baris Baykant Alagoz

PDF

Open Access

TL;DR

This paper introduces an improved LSTM model for predicting global temperature anomalies, showing better long-term forecasts compared to existing methods.

Contribution

The novel Additive Twin LSTM (AT-LSTM) model is proposed to enhance long-term temperature anomaly forecasting.

Findings

01

The AT-LSTM model outperforms conventional LSTM variants in long-term global temperature anomaly forecasts.

02

The model predicts a 2042 global temperature anomaly of 1.415 °C with ± 0.073 °C error, aligning with climate organization expectations.

Abstract

Due to the complexity of climate systems, data-driven modeling based on observed time series data is essential for predicting future climatic trends. This study aims to improve the long-term global temperature anomaly forecast performance of Long Short-Term Memory (LSTM) based neural network models. Although several LSTM variants and hybrid architectures have been suggested for time series data prediction problems, the long-term forecast performance of these models may not be satisfactory in practice. To address solution of these problems, firstly, authors focused on evaluating the forecast performance of models and suggested performance and test assessment procedures. Secondly, authors suggest an Additive Twin LSTM (AT-LSTM) model that can improve the forecast performance for the global temperature anomaly. Our test on the Berkeley Global Temperature Anomaly dataset demonstrates that…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases1

temperature anomaly

Figures15

Click any figure to enlarge with its caption.

Forecast performance over test set for the benchmark function $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{y}_{3}$$\end{document}$ .

Training and test performance for the benchmark function $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{y}_{4}$$\end{document}$ .

The temperature predictions of 10th LSTM model for training and test set.

Forecast performance over test set of temperature dataset.

Future forecasts of global temperature anomaly by AT-LSTM models and the average forecast line (red line) for next 20 years.

Fundamental elements of a LSTM unit and their mathematical relations.

Prediction state (upper figure) and forecasting state (bottom figure).

Performance evaluation regions on time-series dataset.

Training and test performances for the benchmark function $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{y}_{1}$$\end{document}$ .

Funding1

—Inonu University Scientific Research Projects Coordination Unit

Keywords

Climate changeGlobal warmingTemperature anomalyForecastingClimate sciencesEnvironmental sciencesMathematics and computing

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsClimate variability and models · Hydrological Forecasting Using AI · Meteorological Phenomena and Simulations

Full text

Introduction

Global warming is one of the global-scale climatic transition problems, leading to increases in global temperatures, glaciers melting, sea-level rise, expanding desert zones, and abnormal climatic changes such as intermittent hurricanes, floats and landslides^1^. Among these effects, the mean global temperature rises have a significant impact on daily-life activities from energy consumption, agricultural and livestock activities, as well as, its serious depression on social, economic and environmental systems^2^. Therefore, forecasting global temperature anomalies is crucial for improving resource management policies, understanding long-term climate change trends, and taking measures to prevent or mitigate associated problems^3,4^. To address this problem, time series prediction is one of the developing fields of big data analysis^5^. Traditional methods for time series data forecasting have not been successful in capturing complicated sequential relationships and long-term dependencies^6^.

Recurrent neural networks (RNN) can inherently learn sequential relations in time series data. Popularly, long short-term memory (LSTM) models have become widely used due to their ability to manage both short-term and long-term dependencies in time series data patterns compared to conventional RNNs^3^. In other words, the learning process of LSTM models is assumed to be relatively insensitive to the sequence length that is important in case of long-term dependencies in datasets^7^. This property of LSTM is very significant for the data-driven modeling practice because deciding the necessary sequence length of RNNs for a real-world data is not an easy problem. To possess this useful property, LSTMs involve specially designed building blocks: A classical LSTM unit is composed of a cell state and three gates, which are known as an input gate, an output gate^8^, and a forget gate^9^. Cell states mainly function as selective memories and gate performs for updating and forgetting information. Following the success of LSTMs in practice, a gated type of RNNs, known as the gated recurrent unit (GRU), was suggested to achieve similar capabilities with fewer parameters. The GRU has similar elements to those of LSTMs. Specifically, the GRU has a two-gate system (i.e., update and reset gates) to handle long-term temporal dependencies in time-series prediction^7^. Practically, GRUs are simpler and faster than LSTMs because they lack a separate cell state^10^. LSTMs are preferable for data sequences that have long-term dependencies, whereas GRU can be potentially effective in learning the relations that do not involve the long-term dependencies. Recent studies have also shown that hybrid architectures combining LSTM, CNN, and GRU models can enhance predictive performance depending on the characteristics of the dataset^11^. Table 1 briefly summarizes research works in the literature, which are related to temperature prediction. Changes in global temperature anomaly were analyzed by using the LSTM, and the NOAA and NASA datasets were used for training LSTM models^12^.

To improve consistency in the collected sequential temperature anomaly data, we used the Berkeley Global Temperature Anomaly monthly data averaged over five years. The five-year long averaging monthly data can significantly reduce uncertainty in global temperature anomaly observations. In our numerical experiments, we observed that the long-term global temperature forecasting performance of LSTM models and many hybrid forms may not be dependable. Even though they can be highly accurate in the short-term prediction (in the next temperature prediction), recursive prediction efforts for long-term temperature forecasting towards future in time, which is called future forecasting, cannot be consistent enough. To address this limitation, we propose two twin-branch architectures: the Additive Twin LSTM (AT-LSTM) and the Multiplicative Twin LSTM (MT-LSTM). The proposed AT-LSTM differs fundamentally from conventional LSTM and GRU architectures by employing a twin-branch recurrent design that allows the model to learn partially uncoupled dominant dynamics within the sequence. This structural distinction enhances long-horizon forecasting stability, as confirmed by benchmark and climate-data evaluations, where AT-LSTM can improve performance relative to representative recurrent models in RMSE, MAE, and R^2^ metrics. Our results indicate that the AT-LSTM model improves future forecasting performance for global temperature anomaly data compared to conventional LSTM and hybrid LSTM architectures. Main advantageous comes from the twin LSTM utilization with a common decoder network, where each LSTM can focus on learning behavior of an uncoupled dominating climate dynamic.

Table 1A brief survey for temperature prediction models.WorksMethodsPredicted dataPerformance/remarksDiffenbaugh et al. (2023)^13^ANN, XAIGlobal warming1.5 °C global warming threshold is expected to be reached in 2033–2035Guo et al. (2024)^14^ANN, RNN, LSTM, CNN, CNN-LSTMMonthly climate parameterCNN-LSTM model provides higher accuracy with MAE, RMSE, R²Uluocak et al. (2024)^15^LSTM-CNN, GRU-CNNDaily air temperatureLSTM-CNN and GRU-CNN models make higher accuracy predictions with MAE, RMSE, NSE, and R²Guo et al. (2023)^16^ANN, GRU, LSTM, CNN, CNN-GRU, CNN-LSTMAtmospheric temperatureThey use R², RMSE, MAE as success scales. The best R² value is 0.9952 with ANN. Average atmospheric temperature in 2030 is predicted to be 17.23 °CHamdan et al. (2023)^12^A mathematical model with RNN and LSTMGlobal temperature and greenhouse gas emission changingLSTM model obtains high accuracy results with RMSE: 2.018 (NOAA) and 0.814 (NASA)Li et al. (2023)^17^SARIMA-LSTMAir temperatureThe accuracy of the model increase from 10.0% to 27.7%Hou et al. (2022)^18^CNN-LSTMHourly air temperatureR²: CNN-LSTM is 0.7258, LSTM is 0.5949, CNN is 0.5291Haque et al. (2021)^19^SRN, GRU, LSTM, CNN, CNN-LSTM, GRU-LSTMHourly air temperatureThe lowest RMSE (1.691 °C) is obtained with GRU-LSTMZhao et al. (2021)^20^CNN-GRU-RPASMAir temperatureCNN-GRU-RPASM shows the best performance compared to traditional methodsZhang et al. (2020)^21^CRNN (model consisting of CNN and RNN components)Air temperatureFor the CRNN, MAE is calculated as 0.907 °C and RMSE is as 1.697 °C.Gong et al. (2024)^22^CNN-LSTMDaily average temperatureThe predicted curve shows strong agreement with the actual test dataLi et al. (2024)^1^CNN-LSTMTemperatureobtained MSE (3.26217) and RMSE (1.80615) values are higher than the traditional methodsKarabulut et al. (2022)^4^LSTMAir temperatureR^2^ value is 0.9937 for LSTM, 0.8869 for SVM

Additive twin LSTM (AT-LSTM) for time series data prediction

Basics of LSTM model

In principle, it employs information-flow controlling mechanisms (gates) in order to exhibit the long-term memory effect by using short-term memory elements. A basic LSTM unit consists of a cell and three gates that are known as an input gate, an output gate^8^, and a forget gate^9^. Cell states mainly function as selective memories that can convey useful (relevant) information over longer time intervals by means of updating and forgetting control of the gates. Here, the forget gate is designed to forget (suppress) irrelevant information, and the input gate is proposed to update the cell memory with relevant new information. Thus, the cell determines cell states ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{C}_{t}$$\end{document}$ ) to maintain useful information over arbitrary time intervals, thereby leading to a long-term memory effect. The output gate determines how the memory is translated at the output at each step. It also contributes to determining the hidden state ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{h}_{t}$$\end{document}$ ) by regulating the cell state. This selective memory property makes the LSTM more efficient and flexible in learning which parts of the sequence are important and should be preserved through the information-flow. Figure 1 shows the fundamental elements of a LSTM unit that describes mathematical relations between the basic elements in a LSTM unit. The information in the cell state flows throughout a sequence of subsequent LSTM units, and such unit array forms a long memory effect that is based on selective learned control of short-memory elements in each LSTM unit.

Fig. 1. Fundamental elements of a LSTM unit and their mathematical relations.

The mathematical foundation of each element can be summarized as:

A forget gate determines the amount of information from the previous cell state should be discarded. It calculates the weighted sum of the previous hidden state ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{h}_{t-1}$$\end{document}$ ) and the current input ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{x}_{t}$$\end{document}$ ), then applies a sigmoid activation function ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\sigma\:}_{g}$$\end{document}$ ) to regulate the gate output between 0 and 1. A value of 0 means “completely forget”, and a value of 1 means “completely keep”.

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{f}_{t}={\sigma\:}_{g}({W}_{f}{x}_{t}+{U}_{f}{h}_{t-1}+{b}_{f})$$\end{document}

where the weight coefficients in forget gate are denoted by the vectors $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{W}_{f}$$\end{document}$ for the current input $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{x}_{t}$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{U}_{f}$$\end{document}$ for the hidden state $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{h}_{t-1}$$\end{document}$ . They are optimized during the training stage of LSTM via the backpropagation algorithm. This allows learning irrelevant information patterns that should be suppressed through cell states.

2)An input gate determines the amount of new information that will be added to the cell state. It uses two networks:

The first one is a layer neuron network with sigmoid activation that regulates the amount of new information ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\stackrel{\sim}{C}}_{t}$$\end{document}$ ).

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{i}_{t}={\sigma\:}_{g}({W}_{i}{x}_{t}+{U}_{i}{h}_{t-1}+{b}_{i})$$\end{document}

where vectors $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{W}_{i}$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{U}_{i}$$\end{document}$ stand for weight coefficients for the current input $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{x}_{t}$$\end{document}$ and previous hidden state $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{h}_{t-1}$$\end{document}$ , respectively. The $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{i}_{t}$$\end{document}$ takes a value between 0 and 1 and for regulating $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\stackrel{\sim}{C}}_{t}$$\end{document}$ .

The second one is a neuron network with a tanh activation ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\sigma\:}_{c}$$\end{document}$ ) that generates a vector of new candidate values ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\stackrel{\sim}{C}}_{t}$$\end{document}$ ) that is added to the cell state after regulating by the $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{i}_{t}$$\end{document}$ .

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\stackrel{\sim}{C}}_{t}={\sigma\:}_{c}({W}_{c}{x}_{t}+{U}_{c}{h}_{t-1}+{b}_{C})$$\end{document}

The input gate and the candidate cell state are then used to update the cell state. The vectors $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{W}_{c}$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{U}_{c}$$\end{document}$ are weights for the current input and previous hidden states, respectively. Values of these vectors are optimized during the training stage.

3)A current cell state is updated by weighted sum of the previous cell state $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{C}_{t-1}$$\end{document}$ and the new cell state $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\stackrel{\sim}{C}}_{t}$$\end{document}$ as follows

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{C}_{t}={f}_{t}\otimes\:{C}_{t-1}+{i}_{t}\otimes\:{\stackrel{\sim}{C}}_{t}$$\end{document}

Here, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{f}_{t}\in\:\left[\mathrm{0,1}\right]$$\end{document}$ performs for suppression of the previous cell state $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{C}_{t-1}$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{i}_{t}\in\:\left[\mathrm{0,1}\right]$$\end{document}$ performs for update with the new cell state $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\stackrel{\sim}{C}}_{t}$$\end{document}$ . The operator $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\otimes\:$$\end{document}$ represents Hadamard product (element-wise product).

4)The output gate decides the next hidden state that is used in the next LSTM unit, and it is also used as the output of the LSTM at each time step. The output state is calculated as

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{o}_{t}={\sigma\:}_{g}({W}_{o}{x}_{t}+{U}_{o}{h}_{t-1}+{b}_{o})$$\end{document}

where the $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{W}_{o}$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{U}_{o}$$\end{document}$ denotes the weight coefficient vectors for optimizing during the training stage. The terms $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{U}_{o}{h}_{t-1}$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{U}_{c}{h}_{t-1}$$\end{document}$ stand for the learned dynamics response from the data sequence and they are essential components in determination of the next hidden state in from of

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{h}_{t}={O}_{t}\otimes\:{\sigma\:}_{c}\left({C}_{t}\right)$$\end{document}

Additive twin LSTM

In nature, data sequences, which are collected from complex systems, can involve more than one uncoupled dominating dynamics, which can be almost independently acting^23–25^, and measured data involves components from these dynamics. Inherently, a composition of these dominating dynamic factors can establish the sequential relations in the collected data sequences. For the case of uncoupled two dominating dynamics, state transitions can be expressed $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{d}_{t+\mathrm{1,1}}={F}_{1}\left({d}_{t,1},{x}_{t,1}\right)$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{d}_{t+\mathrm{1,2}}={F}_{2}\left({d}_{t,2},{x}_{t,2}\right)$$\end{document}$ , the measured data sequence can be expressed in the form of

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{y}_{t}=G({d}_{t,1},{d}_{t,2})$$\end{document}

where the function $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:G(.)$$\end{document}$ represents the measurement function, which can be referred to as the decoding function. In multi-component systems, the measurement function (decoding function) can be expressed as a weighted sum of each dynamic factors. (See the property of approximation to independent recurrent relations in Supplementary Material). For data-driven modelling of complex systems with uncoupled multi-dynamic factors, the multi-dynamics modeling approach based on consideration of uncoupled dominating dynamics can present potential of better expressing sequential relations in data sequences. To benefit from this asset, we considered a twin additive LSTM model, where each LSTM can focus on learning one of uncoupled dominating dynamics transition functions ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{F}_{1}$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\:{F}_{2}$$\end{document}$ ) and the neural decoding network learns the function of the measurement function ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:G)$$\end{document}$ .

In fact, the global temperature anomaly measurements involve impacts of more than one weakly-coupled or uncoupled dynamic factors, such as dynamics associated with global scale CO_2_ emission and Albedo perturbation etc. Energy budget models indicate that global temperature evolution is driven by multiple uncoupled components of the climate system (e.g., radiative forcing response, ocean–atmosphere adjustments, and slow–fast feedback mechanisms)^26–28^. These models suggest that climate dynamics naturally consist of distinct, partially independent temporal processes. This physical characteristic directly aligns with the structural design of the proposed AT-LSTM architecture, in which two parallel recurrent branches are intended to learn disentangled and uncoupled dynamic modes of the underlying system. For these reason, two LSTM approach well suits for the solution of the global temperature anomaly forecasting problem.

In this perspective, the proposed AT-LSTM implements the addition of two identical LSTMs. Figure 2 shows basic blocks of the AT-LSTM that combines two identical LSTM blocks by using an additional block and a neural decoder that yields the output of the model. This implementation provides behavioral and computational benefits: For behavioral benefits, two separate LSTMs can focus on learning two uncoupled dynamics, which are dominative in the composition of sequential relations in the data sequence. For computational benefit, the outputs of two LSTM are added and the sum of two sigmoid functions at the output gates forms a joint activation characteristic. In other words, sum of the two sigmoid functions ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\sigma\:}_{g}\left(v\right)=1/(1+{e}^{-v})$$\end{document}$ ) establishes a activation function that can be expressed in the form of a joint activation as

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\sigma\:}_{sg}({v}_{1},{v}_{2})={\sigma\:}_{g}\left({v}_{1}\right)+{\sigma\:}_{g}\left({v}_{2}\right)=\frac{2+{e}^{-\left({v}_{1}\right)}+{e}^{-\left({v}_{2}\right)}}{(1+{e}^{-\left({v}_{1}\right)})(1+{e}^{-\left({v}_{2}\right)})}$$\end{document}

By considering the Eq. (8), the joint activation function $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\sigma\:}_{sg}({v}_{1},{v}_{2})$$\end{document}$ can map output of LSTM networks into the range of 0 to 2 in additive form, and it provides range expansion at input of the neural decoding part.

Fig. 2. Basic blocks of the AT-LSTM.

To form the AT-LSTM network, the outputs of two LSTM networks are summed:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{gathered} \:o_{t} = o_{{t,1}} + o_{{t,2}} = \sigma \:_{g} \left( {W_{{o,1}} x_{{t,1}} + U_{{o,1}} h_{{t - 1,1}} + b_{{o,1}} } \right) + \sigma \:_{g} \left( {W_{{o,2}} x_{{t,2}} + U_{{o,2}} h_{{t - 1,2}} + b_{{o,2}} } \right) \hfill \\ = \sigma \:_{{sg}} (W_{{o,1}} x_{{t,1}} + U_{{o,1}} h_{{t - 1,1}} + b_{{o,1}} ,\:W_{{o,2}} x_{{t,2}} + U_{{o,2}} h_{{t - 1,2}} + b_{{o,2}} ) \hfill \\ \end{gathered}$$\end{document}

The output gates can be expressed in the form of additive sigmoid activation as follows

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{o}_{t}={o}_{t,1}+{o}_{t,2}=\frac{2+{e}^{-({W}_{o,1}{x}_{t,1}+{U}_{o,1}{h}_{t-\mathrm{1,1}}+{b}_{o,1})}+{e}^{-({W}_{o,2}{x}_{t,2}+{U}_{o,2}{h}_{t-\mathrm{1,2}}+{b}_{o,2})}}{(1+{e}^{-({W}_{o,1}{x}_{t,1}+{U}_{o,1}{h}_{t-\mathrm{1,1}}+{b}_{o,1})})(1+{e}^{-({W}_{o,2}{x}_{t,2}+{U}_{o,2}{h}_{t-\mathrm{1,2}}+{b}_{o,2})})}$$\end{document}

This section theoretically considers mechanisms that make the AT-LSTM more advantageous in learning sequential relations compared to the conventional LSTM.

A forecast performance evaluation procedure for time series data

RNNs can learn sequential relations through sequential data. To evaluate RNN model performance, it is useful to distinguish the prediction error, which refers to the error for predicting data instance in the dataset, and the forecast error, which refers to the error for a forecasted sequence that does not exist in the dataset. For prediction performance evaluation, one applies an element of the sequence ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:X\left(n\right)$$\end{document}$ ) from the dataset and predicts the next element ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\stackrel{\sim}{X}(n+1)$$\end{document}$ ) in the dataset. For forecasting performance evaluation, one applies a predicted element ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\stackrel{\sim}{X}\left(n\right)$$\end{document}$ ) and predicts the next elements ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\stackrel{\sim}{X}(n+1)$$\end{document}$ ). Therefore, it forms feedback from the output of the RNN to the input of RNN and enables consequential future forecasting for a time horizon. However, in this case, this feedback loop causes accumulation of errors in long-term performance evaluation. Figure 3 depicts the prediction and forecast states of RNN models. Consequently, the prediction performance based on training and test data does not truly express the forecast performance of RNN for time series data. In fact, the prediction performance expresses the performance mainly for reproducing the dataset by using elements of the dataset. The forecast performance can express the performance of estimating future elements of a time series. For this reason, the future estimation performance of RNNs should be considered by using the forecast performance. However, assessment of future forecasting performance is a complicated problem because future data is not available, yet. For this case, the nearest forecast performance (the best forecast performance estimate) for the long-term future forecasting can be estimated by considering the forecast performance of the model at the latest part of the dataset. This part can be the last section of test datasets as illustrated in Fig. 4. The test dataset is commonly formed by using the last part (the most recent part) of the time series datasets. Due to being the closest to future prediction, the forecast performance on the test dataset can be used for the nearest forecast performance estimation of future prediction performance of RNNs. Figure 4 describes performance regions on the sequential dataset.

Fig. 3. Prediction state (upper figure) and forecasting state (bottom figure).

Fig. 4. Performance evaluation regions on time-series dataset.

In the long-term forecast process, the forecasted element is used as input in order to predict the next value of the model, and this feedback in the forecasting process causes the accumulation of model prediction errors in the long-term processing. This error accumulation can mislead the forecasting process. Consequently, the forecasting error of RNNs can grow as the forecasting horizon expands into the future, a limitation that has also been emphasized in recent studies on data-driven long-term prediction frameworks^29,30^.

Global temperature anomaly forecast

Berkeley’s global temperature anomaly dataset

The temperature anomaly indicates temperature differences in measured temperatures relative to the average reference temperatures of a reference period^31^. It is an important observational parameter for climate change studies. Berkeley Temperature Anomaly (Berkeley Earth Surface Temperature-BEST) data^32^ are generally utilized in global warming and climate change studies^22,33–37^. The BEST has been widely used for monitoring temperature changes on the Earth’s surface^38^. This dataset provides long-term (1850–2024) temperature anomaly from temperature measurements from hundreds of meteorological stations worldwide^32^. It includes temperature anomaly data for monthly, annual, five-year average, ten-year average, and twenty-year averages. Short-term (monthly and even annual) temperature data can usually contain noise and short-term fluctuations. The long-term temperature averages are useful to compensate the effects of measurement noises and short-term fluctuations in the temperature data, and they provide more reliable results that can reveal general climate trends. The analyses of global temperature anomalies should be independent of transient local impacts. Evidently, long-term temperature averages can be more expressive dominating trends in climatic changes, such as signs of the warming or cooling trends. For this reason, five-year average values were considered in this study, and temperature anomaly data between the 6th month of 1852 and the 3rd month of 2022 were used in our numerical study. This time interval excludes the earliest years due to sliding window average operation. This 5-years averaging improves the reliability of the input sequence used for neural network training. The resulting 170-year time series provides a sufficiently large dataset for training deep learning models and for analyzing long-term climate dynamics. The complete dataset is publicly accessible through the Berkeley Earth repository, allowing independent verification and replication of all preprocessing procedures.

Some benchmark functions for global warming trends

Benchmark functions are commonly preferred to evaluate performance of algorithms (e.g., optimization algorithm, machine learning algorithms) in a controlled test environment (without uncertainty, environmental disturbances, measurement noise etc.). These functions are used to produce synthetically generated data for specific test cases and allows comparison of algorithms for those specific cases. We suggested three different benchmark functions in Table 2. These functions can characteristically simulate extreme temperature anomaly trends and produce synthetic data to benchmark forecast models.

Table 2. The suggested benchmark functions and parameters.Benchmark functionsEquationsParametersExponential function $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{y}_{1}={\mathrm{e}}^{a\mathrm{n}}$$\end{document}$

$\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:a=0.1$$\end{document}$ Sinusoidal function with exponential amplitude $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{y}_{2}={\mathrm{sin}\left(2\pi\:fn\right)\mathrm{e}}^{a\mathrm{n}}$$\end{document}$

$\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:f=1$$\end{document}$

$\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:a=0.1$$\end{document}$ Sinusoidal function with exponential offset $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{y}_{3}={A\mathrm{sin}\left(2\pi\:fn\right)+\mathrm{e}}^{a\mathrm{n}}$$\end{document}$

$\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:A=0.1$$\end{document}$

$\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:f=1$$\end{document}$

$\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:a=0.1$$\end{document}$ Nonstationary trend–sinusoidal function with additive noise $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{y}_{4}={n}^{a}+A\mathrm{sin}\left(2\pi\:fn\right)+\epsilon\:$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:a=0.05$$\end{document}$ ; $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:A=0.1$$\end{document}$ ; $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\upepsilon\:}=0.05\left(rand\right(1,length\left(n\right)-0.5)$$\end{document}$

In this section, to assess the performance of the proposed AT-LSTM model, the prediction performances on both the training and test sets for the benchmark functions were analyzed. Figures 5, 7, 9, and 11 show predictions of AT-LSTM models for training and test datasets for benchmark functions in Table 2. One can observe that the AT-LSTM model can accurately predict training and test data and provides satisfactory training and test performances. However, complication related to the long-term forecasting process, consequences of the error accumulation effect, is apparent in Figs. 6, 8, 10, and 12. These figures show the forecasting performances over the test dataset for the benchmark functions, and as forecasting horizon expands, divergences from test data becomes more apparent. Those divergences clearly indicate decreases in consistency of long-term forecasting and it better reveals long-term forecast performance of models. Therefore, the forecasting performances over test dataset should be considered in order to estimate practical performance of the forecaster models. Major reason of these growing errors is the accumulation of each prediction error in the feedback loop as illustrated in Fig. 3. In these figures, one observes that the forecast performance is acceptable up to about the first 80 data in the 240 data forecast horizon. Although the AT-LSTM model exhibits notable training and test performances in predicting the training and test sets, the long-term forecasting effort can maintain consistency up to a window of 80 data in forecast horizon for these benchmark functions. It should be noted that consideration of only training and test performances cannot sufficient to evaluate forecaster model performance.

Fig. 5. Training and test performances for the benchmark function $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{y}_{1}$$\end{document}$ .

Fig. 6. Forecast performance over test set for the benchmark function $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{y}_{1}$$\end{document}$ .

Fig. 7. Training and test performances for the benchmark function $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{y}_{2}$$\end{document}$ .

Fig. 8. Forecast performance over test set for the benchmark function $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{y}_{2}$$\end{document}$ .

Fig. 9. Training and test performance for the benchmark function $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{y}_{3}$$\end{document}$ .

Fig. 10. Forecast performance over test set for the benchmark function $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{y}_{3}$$\end{document}$ .

Fig. 11. Training and test performance for the benchmark function $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{y}_{4}$$\end{document}$ .

Fig. 12. Forecast performance over test set for the benchmark function $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{y}_{4}$$\end{document}$ .

Comparison of the forecasting performance of several LSTM models

This section presents performance data for the AT-LSTM model and other LSTM networks from literature. The proposed AT-LSTM consists of two independently parameterized LSTM branches. Each branch contains an LSTM layer with 128 hidden units, and no weight sharing is employed. The outputs of the two branches are merged using an addition layer, after which the combined representation is passed to a two-stage fully connected decoder composed of a 10-unit fully connected layer followed by a 1-unit output layer. A standard regression layer is used for loss computation. No dropout layers or explicit L2 regularization are used; stabilization is primarily achieved through gradient thresholding and the learning-rate schedule (Detailed descriptions of the model architectures are provided in the Supplementary Material). The training, test, and forecast performance of these models allow us to discuss the suitability of this model for long-term forecasting of the global temperature anomaly. The parameters and their values used in training are given in Table 3.

Table 3. Parameters used in training processes.ParameterValue/methodEpochs500Initial learning rate0.005OptimizerAdaptive moment estimation (Adam)Gradient threshold1Learning rate schedulePiecewiseLearning rate drop period125Learning rate drop factor0.2

Training performance evaluation Synthetic and real datasets were utilized for rigorous performance evaluation. For synthetic data generation, 1000 sequential sample of each benchmark function were used. For real data, the five-year average temperature anomaly data of the Berkeley Temperature Anomaly dataset, consisting of 2038 data, were implemented. For all datasets, the last 240 data sample are used for the test set. The remaining part is allocated for the training set, and they are used for training of models. The training process was repeated 10 times for each model architecture, and the average values of the training performance indices were calculated for both the benchmark functions ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\mathrm{y}}_{1}$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{y}_{2}$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{y}_{3}$$\end{document}$ ) and the temperature anomaly. Results are reported in Tables 4, 5 and 6. One observes that some models can provide better performances than the proposed AT-LSTM model. However, it should be noticed that the training performance does not truly express long-term forecasting performance of models. In fact, it can indicate the data reproduction performance of models over the training dataset.

Table 4. Average of root mean square error (RMSE) for training performance.Architecture $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\boldsymbol{y}}_{1}$$\end{document}$