Recurrent Memory-Augmented Transformers with Chunked Attention for Long-Context Language Modeling

Ankit Kashyap

arXiv:2507.00453·cs.LG·July 2, 2025

Recurrent Memory-Augmented Transformers with Chunked Attention for Long-Context Language Modeling

Ankit Kashyap

PDF

Open Access

TL;DR

This paper introduces a novel Transformer architecture that combines chunked local attention and a gated FIFO memory to efficiently model long contexts in language tasks, maintaining computational efficiency.

Contribution

It proposes a unified attention mechanism with biologically inspired components, enabling long-range dependency modeling without quadratic attention costs.

Findings

01

Efficient handling of long-context dependencies in language modeling.

02

Implementation of a lightweight, modular Transformer architecture from scratch.

03

Versatility demonstrated across dialogue, code, and document tasks.

Abstract

We present a Transformer architecture for long-context language modeling that combines global attention with two biologically inspired components: chunked local attention and a gated FIFO memory mechanism. This unified attention block allows the model to efficiently handle both short-range and long-range dependencies without increasing attention cost quadratically. The memory module persistently stores past token representations using a gated update mechanism inspired by recurrent networks. Rotary positional encoding is applied per attention head to enable directionally disentangled, scale-invariant positional signals. The architecture is implemented entirely from scratch in PyTorch, with no reliance on high-level libraries, enabling transparent and modular experimentation. Our model offers a lightweight and extensible design for tasks such as dialogue modeling, code completion, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Multimodal Machine Learning Applications