HMT: Hierarchical Memory Transformer for Efficient Long Context Language Processing
Zifan He, Yingqi Cao, Zongyue Qin, Neha Prakriya, Yizhou Sun, Jason, Cong

TL;DR
HMT introduces a hierarchical memory architecture for transformers that mimics human memory, significantly enhancing long-context processing and reducing resource requirements across various language tasks.
Contribution
The paper proposes the Hierarchical Memory Transformer (HMT), a novel memory-augmented framework that improves long-context understanding by imitating human memory hierarchy.
Findings
HMT outperforms previous models in long-context tasks.
HMT achieves comparable or better quality with fewer parameters.
HMT reduces inference memory significantly.
Abstract
Transformer-based large language models (LLM) have been widely used in language processing applications. However, due to the memory constraints of the devices, most of them restrict the context window. Even though recurrent models in previous works can memorize past tokens to enable unlimited context and maintain effectiveness, they have ``flat'' memory architectures. Such architectures have limitations in selecting and filtering information. Since humans are good at learning and self-adjustment, we believe that imitating brain memory hierarchy is beneficial for model memorization. Thus, we propose the Hierarchical Memory Transformer (HMT), a novel framework that facilitates a model's long-context processing ability by imitating human memorization behavior. Leveraging memory-augmented segment-level recurrence, we organize the memory hierarchy by preserving tokens from early input…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Dropout · Label Smoothing · Residual Connection · Softmax · Absolute Position Encodings · Byte Pair Encoding
