R-Transformer: Recurrent Neural Network Enhanced Transformer
Zhiwei Wang, Yao Ma, Zitao Liu, Jiliang Tang

TL;DR
The R-Transformer combines RNN and Transformer features to effectively model local and long-term sequence dependencies without position embeddings, outperforming current methods across various tasks.
Contribution
It introduces a novel R-Transformer model that integrates RNN and multi-head attention to capture both local and global sequence structures without position embeddings.
Findings
Outperforms state-of-the-art methods in multiple sequence tasks.
Effectively models local structures and long-term dependencies.
Eliminates the need for position embeddings.
Abstract
Recurrent Neural Networks have long been the dominating choice for sequence modeling. However, it severely suffers from two issues: impotent in capturing very long-term dependencies and unable to parallelize the sequential computation procedure. Therefore, many non-recurrent sequence models that are built on convolution and attention operations have been proposed recently. Notably, models with multi-head attention such as Transformer have demonstrated extreme effectiveness in capturing long-term dependencies in a variety of sequence modeling tasks. Despite their success, however, these models lack necessary components to model local structures in sequences and heavily rely on position embeddings that have limited effects and require a considerable amount of design efforts. In this paper, we propose the R-Transformer which enjoys the advantages of both RNNs and the multi-head attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Time Series Analysis and Forecasting · Anomaly Detection Techniques and Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
