Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers
Mark Horton, Tergel Molom-Ochir, Peter Liu, Bhavna Gopal, Chiyue Wei,, Cong Guo, Brady Taylor, Deliang Fan, Shan X. Wang, Hai Li, and Yiran Chen

TL;DR
Hamming Attention Distillation (HAD) binarizes keys and queries in transformer models to drastically improve efficiency for long-context tasks while maintaining high accuracy, enabling practical deployment on custom hardware.
Contribution
HAD introduces a novel binarization and sparsification framework for attention mechanisms, significantly reducing computational costs and hardware resources with minimal accuracy loss.
Findings
Achieves 1.78% performance loss on GLUE, outperforming previous binarization methods.
Reduces hardware area by 79% and power consumption by 87%.
Maintains high accuracy on ImageNet and QuALITY benchmarks.
Abstract
Pre-trained transformer models with extended context windows are notoriously expensive to run at scale, often limiting real-world deployment due to their high computational and memory requirements. In this paper, we introduce Hamming Attention Distillation (HAD), a novel framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains. By converting keys and queries into {-1, +1} vectors and replacing dot-product operations with efficient Hamming distance computations, our method drastically reduces computational overhead. Additionally, we incorporate attention matrix sparsification to prune low-impact activations, which further reduces the cost of processing long-context sequences. \par Despite these aggressive compression strategies, our distilled approach preserves a high degree of representational power, leading to substantially improved…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Parallel Computing and Optimization Techniques · Ferroelectric and Negative Capacitance Devices
MethodsSoftmax · Attention Is All You Need
