Associative Recurrent Memory Transformer
Ivan Rodkin, Yuri Kuratov, Aydar Bulatov, Mikhail Burtsev

TL;DR
This paper introduces ARMT, a neural architecture combining transformer self-attention and recurrence to efficiently handle very long sequences, achieving state-of-the-art results in long-context associative retrieval tasks.
Contribution
The paper presents ARMT, a novel model that integrates local self-attention with segment-level recurrence for improved long-sequence processing.
Findings
ARMT outperforms existing models in associative retrieval tasks.
Achieves 79.9% accuracy on the BABILong benchmark.
Demonstrates efficient processing of over 50 million tokens.
Abstract
This paper addresses the challenge of creating a neural architecture for very long sequences that requires constant time for processing new information at each time step. Our approach, Associative Recurrent Memory Transformer (ARMT), is based on transformer self-attention for local context and segment-level recurrence for storage of task specific information distributed over a long context. We demonstrate that ARMT outperfors existing alternatives in associative retrieval tasks and sets a new performance record in the recent BABILong multi-task long-context benchmark by answering single-fact questions over 50 million tokens with an accuracy of 79.9%. The source code for training and evaluation is available on github.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Packet Processing and Optimization
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Byte Pair Encoding · Layer Normalization · Label Smoothing · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam
