Associative Recurrent Memory Transformer

Ivan Rodkin; Yuri Kuratov; Aydar Bulatov; Mikhail Burtsev

arXiv:2407.04841·cs.CL·February 17, 2025

Associative Recurrent Memory Transformer

Ivan Rodkin, Yuri Kuratov, Aydar Bulatov, Mikhail Burtsev

PDF

Open Access 1 Repo

TL;DR

This paper introduces ARMT, a neural architecture combining transformer self-attention and recurrence to efficiently handle very long sequences, achieving state-of-the-art results in long-context associative retrieval tasks.

Contribution

The paper presents ARMT, a novel model that integrates local self-attention with segment-level recurrence for improved long-sequence processing.

Findings

01

ARMT outperforms existing models in associative retrieval tasks.

02

Achieves 79.9% accuracy on the BABILong benchmark.

03

Demonstrates efficient processing of over 50 million tokens.

Abstract

This paper addresses the challenge of creating a neural architecture for very long sequences that requires constant time for processing new information at each time step. Our approach, Associative Recurrent Memory Transformer (ARMT), is based on transformer self-attention for local context and segment-level recurrence for storage of task specific information distributed over a long context. We demonstrate that ARMT outperfors existing alternatives in associative retrieval tasks and sets a new performance record in the recent BABILong multi-task long-context benchmark by answering single-fact questions over 50 million tokens with an accuracy of 79.9%. The source code for training and evaluation is available on github.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

RodkinIvan/associative-recurrent-memory-transformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNetwork Packet Processing and Optimization

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Byte Pair Encoding · Layer Normalization · Label Smoothing · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam