Hamming Attention Distillation: Binarizing Keys and Queries for   Efficient Long-Context Transformers

Mark Horton; Tergel Molom-Ochir; Peter Liu; Bhavna Gopal; Chiyue Wei,; Cong Guo; Brady Taylor; Deliang Fan; Shan X. Wang; Hai Li; and Yiran Chen

arXiv:2502.01770·cs.LG·February 5, 2025

Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers

Mark Horton, Tergel Molom-Ochir, Peter Liu, Bhavna Gopal, Chiyue Wei,, Cong Guo, Brady Taylor, Deliang Fan, Shan X. Wang, Hai Li, and Yiran Chen

PDF

Open Access

TL;DR

Hamming Attention Distillation (HAD) binarizes keys and queries in transformer models to drastically improve efficiency for long-context tasks while maintaining high accuracy, enabling practical deployment on custom hardware.

Contribution

HAD introduces a novel binarization and sparsification framework for attention mechanisms, significantly reducing computational costs and hardware resources with minimal accuracy loss.

Findings

01

Achieves 1.78% performance loss on GLUE, outperforming previous binarization methods.

02

Reduces hardware area by 79% and power consumption by 87%.

03

Maintains high accuracy on ImageNet and QuALITY benchmarks.

Abstract

Pre-trained transformer models with extended context windows are notoriously expensive to run at scale, often limiting real-world deployment due to their high computational and memory requirements. In this paper, we introduce Hamming Attention Distillation (HAD), a novel framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains. By converting keys and queries into {-1, +1} vectors and replacing dot-product operations with efficient Hamming distance computations, our method drastically reduces computational overhead. Additionally, we incorporate attention matrix sparsification to prune low-impact activations, which further reduces the cost of processing long-context sequences. \par Despite these aggressive compression strategies, our distilled approach preserves a high degree of representational power, leading to substantially improved…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Parallel Computing and Optimization Techniques · Ferroelectric and Negative Capacitance Devices

MethodsSoftmax · Attention Is All You Need