Compact Recurrent Transformer with Persistent Memory
Edison Mucllari, Zachary Daniels, David Zhang, Qiang Ye

TL;DR
This paper introduces a Compact Recurrent Transformer (CRT) that efficiently processes long sequences by combining shallow local transformers with a persistent memory, reducing computational costs while maintaining high performance.
Contribution
The novel CRT model integrates recurrent neural networks with shallow transformers to effectively manage global information with lower computational overhead.
Findings
CRT achieves comparable or better results than full Transformers on language tasks.
CRT uses significantly fewer FLOPs and shorter segments.
State-of-the-art performance on Toyota Smarthome dataset.
Abstract
The Transformer architecture has shown significant success in many language processing and visual tasks. However, the method faces challenges in efficiently scaling to long sequences because the self-attention computation is quadratic with respect to the input length. To overcome this limitation, several approaches scale to longer sequences by breaking long sequences into a series of segments, restricting self-attention to local dependencies between tokens within each segment and using a memory mechanism to manage information flow between segments. However, these approached generally introduce additional compute overhead that restricts them from being used for applications where limited compute memory and power are of great concern (such as edge computing). We propose a novel and efficient Compact Recurrent Transformer (CRT), which combines shallow Transformer models that process short…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Reservoir Computing · Photonic and Optical Devices · Neural Networks and Applications
MethodsLinear Layer · Multi-Head Attention · Dense Connections · Adam · Attention Is All You Need · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Softmax
