Memory Transformer

Mikhail S. Burtsev; Yuri Kuratov; Anton Peganov; Grigory V. Sapunov

arXiv:2006.11527·cs.CL·February 17, 2021·6 cites

Memory Transformer

Mikhail S. Burtsev, Yuri Kuratov, Anton Peganov, Grigory V. Sapunov

PDF

Open Access 1 Repo

TL;DR

This paper explores augmenting Transformer models with trainable memory tokens to enhance global context processing, demonstrating improved performance in translation and language modeling tasks, and providing insights into attention mechanisms.

Contribution

The paper introduces novel methods for integrating memory tokens into Transformers, including memory bottlenecks and update controls, to improve global information handling.

Findings

01

Memory-augmented Transformers outperform baseline models in translation and language modeling.

02

Memory tokens enhance the model's ability to process global context.

03

Memory augmentation shows mixed results on GLUE benchmark tasks.

Abstract

Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. The self-attention architecture allows transformer to combine information from all elements of a sequence into context-aware representations. However, information about the context is stored mostly in the same element-wise representations. This might limit the processing of properties related to the sequence as a whole more difficult. Adding trainable memory to selectively store local as well as global representations of a sequence is a promising direction to improve the Transformer model. Memory-augmented neural networks (MANNs) extend traditional neural architectures with general-purpose memory for representations. MANNs have demonstrated the capability to learn simple algorithms like Copy or Reverse and can be successfully trained via backpropagation on diverse tasks from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lucidrains/x-transformers
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnalog and Mixed-Signal Circuit Design · Advanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Multi-Head Attention · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Byte Pair Encoding