Memory Transformer
Mikhail S. Burtsev, Yuri Kuratov, Anton Peganov, Grigory V. Sapunov

TL;DR
This paper explores augmenting Transformer models with trainable memory tokens to enhance global context processing, demonstrating improved performance in translation and language modeling tasks, and providing insights into attention mechanisms.
Contribution
The paper introduces novel methods for integrating memory tokens into Transformers, including memory bottlenecks and update controls, to improve global information handling.
Findings
Memory-augmented Transformers outperform baseline models in translation and language modeling.
Memory tokens enhance the model's ability to process global context.
Memory augmentation shows mixed results on GLUE benchmark tasks.
Abstract
Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. The self-attention architecture allows transformer to combine information from all elements of a sequence into context-aware representations. However, information about the context is stored mostly in the same element-wise representations. This might limit the processing of properties related to the sequence as a whole more difficult. Adding trainable memory to selectively store local as well as global representations of a sequence is a promising direction to improve the Transformer model. Memory-augmented neural networks (MANNs) extend traditional neural architectures with general-purpose memory for representations. MANNs have demonstrated the capability to learn simple algorithms like Copy or Reverse and can be successfully trained via backpropagation on diverse tasks from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnalog and Mixed-Signal Circuit Design · Advanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Multi-Head Attention · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Byte Pair Encoding
