HMT: Hierarchical Memory Transformer for Efficient Long Context Language   Processing

Zifan He; Yingqi Cao; Zongyue Qin; Neha Prakriya; Yizhou Sun; Jason; Cong

arXiv:2405.06067·cs.CL·February 7, 2025·2 cites

HMT: Hierarchical Memory Transformer for Efficient Long Context Language Processing

Zifan He, Yingqi Cao, Zongyue Qin, Neha Prakriya, Yizhou Sun, Jason, Cong

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

HMT introduces a hierarchical memory architecture for transformers that mimics human memory, significantly enhancing long-context processing and reducing resource requirements across various language tasks.

Contribution

The paper proposes the Hierarchical Memory Transformer (HMT), a novel memory-augmented framework that improves long-context understanding by imitating human memory hierarchy.

Findings

01

HMT outperforms previous models in long-context tasks.

02

HMT achieves comparable or better quality with fewer parameters.

03

HMT reduces inference memory significantly.

Abstract

Transformer-based large language models (LLM) have been widely used in language processing applications. However, due to the memory constraints of the devices, most of them restrict the context window. Even though recurrent models in previous works can memorize past tokens to enable unlimited context and maintain effectiveness, they have ``flat'' memory architectures. Such architectures have limitations in selecting and filtering information. Since humans are good at learning and self-adjustment, we believe that imitating brain memory hierarchy is beneficial for model memorization. Thus, we propose the Hierarchical Memory Transformer (HMT), a novel framework that facilitates a model's long-context processing ability by imitating human memorization behavior. Leveraging memory-augmented segment-level recurrence, we organize the memory hierarchy by preserving tokens from early input…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

OswaldHe/HMT-pytorch
pytorchOfficial

Models

🤗
OswaldHe123/HMT-Llama3.1-8B-OpenROAD
model

Videos

HMT: Hierarchical Memory Transformer for Efficient Long Context Language Processing· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Dropout · Label Smoothing · Residual Connection · Softmax · Absolute Position Encodings · Byte Pair Encoding