Efficiently applying attention to sequential data with the Recurrent Discounted Attention unit
Brendan Maginnis, Pierre H. Richemond

TL;DR
The paper introduces the Recurrent Discounted Attention (RDA) unit, an advancement over the RWA, enabling better handling of changing attention over sequences, leading to improved performance and efficiency on various tasks.
Contribution
The RDA unit extends the RWA by incorporating discounting of past attention, allowing dynamic adjustment and improving sequence modeling capabilities.
Findings
RDA learns faster than LSTM and GRU on single output tasks.
RDA outperforms RWA and other units on multiple sequence copy tasks.
RDA performs competitively with LSTM on Wikipedia character prediction.
Abstract
Recurrent Neural Networks architectures excel at processing sequences by modelling dependencies over different timescales. The recently introduced Recurrent Weighted Average (RWA) unit captures long term dependencies far better than an LSTM on several challenging tasks. The RWA achieves this by applying attention to each input and computing a weighted average over the full history of its computations. Unfortunately, the RWA cannot change the attention it has assigned to previous timesteps, and so struggles with carrying out consecutive tasks or tasks with changing requirements. We present the Recurrent Discounted Attention (RDA) unit that builds on the RWA by additionally allowing the discounting of the past. We empirically compare our model to RWA, LSTM and GRU units on several challenging tasks. On tasks with a single output the RWA, RDA and GRU units learn much quicker than the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Bayesian Modeling and Causal Inference · Time Series Analysis and Forecasting
MethodsSigmoid Activation · Tanh Activation · Gated Recurrent Unit · Long Short-Term Memory
