Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference

Siyuan Yan; Guo-Qing Jiang; Yuchen Zhang; Xiaoxing Ma; Ran Zhu; Chun Cao; Jingwei Xu

arXiv:2510.18413·cs.CL·October 22, 2025

Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference

Siyuan Yan, Guo-Qing Jiang, Yuchen Zhang, Xiaoxing Ma, Ran Zhu, Chun Cao, Jingwei Xu

PDF

Open Access 3 Reviews

TL;DR

Adamas introduces a novel sparse attention mechanism using Hadamard transforms and efficient selection techniques, enabling accurate long-context inference in language models with significantly reduced computational costs.

Contribution

It proposes a new sparse attention method that maintains accuracy with higher sparsity levels and faster inference for long sequences, outperforming existing approaches.

Findings

01

Matches full attention accuracy with only 64 tokens budget

02

Achieves near-lossless performance at 128 tokens

03

Supports up to 8x higher sparsity with substantial speedups

Abstract

Large language models (LLMs) now support context windows of hundreds of thousands to millions of tokens, enabling applications such as long-document summarization, large-scale code synthesis, multi-document question answering and persistent multi-turn dialogue. However, such extended contexts exacerbate the quadratic cost of self-attention, leading to severe latency in autoregressive decoding. Existing sparse attention methods alleviate these costs but rely on heuristic patterns that struggle to recall critical key-value (KV) pairs for each query, resulting in accuracy degradation. We introduce Adamas, a lightweight yet highly accurate sparse attention mechanism designed for long-context inference. Adamas applies the Hadamard transform, bucketization and 2-bit compression to produce compact representations, and leverages Manhattan-distance estimation for efficient top-k selections.…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

- The pipeline couples an orthogonal transform that smooths activations with low‑bit quantization, enabling efficient integer-only L1 screening before exact attention. The identity \((QH)(KH)^\top = QK^\top\) justifies operating in the Hadamard basis without loss. - Token‑level dynamic selection: Unlike Quest’s page‑level granularity, Adamas selects at the token level, which plausibly improves recall under tight budgets. The LongBench curves show the smallest gap to full attention at low budget

Weaknesses

- The paper does not explicitly describe how Adamas operates during the prefill stage or whether the proposed sparse attention strategy is applied there. If the sparse selection is only used during decoding while the prefill phase still relies on full dense attention, then evaluations on datasets like LongBench—whose inputs involve long prefill sequences—may not fully reflect the accuracy implications of the sparse mechanism. Clarifying whether Adamas affects both prefill and decoding phases (or

Reviewer 02Rating 2Confidence 4

Strengths

- The method shows consistent improvements over baselines across multiple benchmarks with good speedups. - The paper includes perplexity, accuracy, and efficiency metrics, plus ablation studies validating each component. - Custom CUDA kernels demonstrate real-world feasibility with actual latency measurements.

Weaknesses

- ( Contribution) The core idea of using Hadamard transforms to smooth distributions before quantization is borrowed directly from QuaRot. The novelty claim is therefore quite weak. - (Method) "Bucketization" looks a lot like standard quantization. - (Baselines) The comparison is limited to StreamingLLM, that is a basic sliding window method that performs poorly on retrieval tasks by design and Quest, that performs page-level selection with known coarse-grained limitations. A lot of Sparse Atte

Reviewer 03Rating 8Confidence 4

Strengths

1. Integrating some existing methods into an elegant solution can achieve better results in existing training-free methods. 2. It is a good idea to use Hadamard transform in similarity estimation/sparse selection, make it to achieve better results within efficient calculation. 3. The paper is clearly structured and well-written. The authors provide a thorough explanation of the research motivation, methodology, and experimental results.

Weaknesses

1. In the efficiency analysis, the kernel analysis can be more detailed. Except for full attention, there is no efficiency comparison with other sparse methods. 2. Why it achieves similar results to full attention, in addition to experimental comparison, it would be better if some explanations were given from other aspects, such as case analysis.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare