BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination

Huizheng Wang; Hongbin Wang; Shaojun Wei; Yang Hu; Shouyi Yin

arXiv:2512.06457·cs.LG·December 9, 2025

BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination

Huizheng Wang, Hongbin Wang, Shaojun Wei, Yang Hu, Shouyi Yin

PDF

Open Access

TL;DR

BitStopper introduces a novel transformer attention acceleration method that combines stage-fusion and early termination techniques, significantly reducing computation and memory costs in large language models.

Contribution

It presents a new hardware-efficient algorithm-architecture co-design that eliminates the need for a sparsity predictor and enhances performance through stage fusion and adaptive token selection.

Findings

01

Achieves over 2x speedup compared to state-of-the-art accelerators.

02

Delivers more than 2x improvements in energy efficiency.

03

Effectively reduces memory access and computation in transformer models.

Abstract

Attention-based large language models (LLMs) have transformed modern AI applications, but the quadratic cost of self-attention imposes significant compute and memory overhead. Dynamic sparsity (DS) attention mitigates this, yet its hardware efficiency is limited by the added prediction stage and the heavy memory traffic it entails. To address these limitations, this paper proposes BitStopper, a fine-grained algorithm-architecture co-design that operates without a sparsity predictor. First, a bit-serial enable stage fusion (BESF) mechanism is proposed to reuse and minimize the memory access by progressively terminating trivial tokens and merging the prediction stage into the execution stage. Second, a lightweight and adaptive token selection (LATS) strategy is developed to work in concert with the bit-level sparsity speculation. Third, a bit-level asynchronous processing (BAP) strategy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Low-power high-performance VLSI design