Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study
Shawn Tan, Songlin Yang, Aaron Courville, Rameswar Panda, Yikang Shen

TL;DR
This paper introduces a novel stick-breaking attention mechanism as an efficient alternative to softmax-based attention, demonstrating competitive performance and improved length generalization in large-scale models.
Contribution
The paper presents a new attention method based on stick-breaking, including implementation details and adaptation of Flash Attention, showing its effectiveness as a drop-in replacement.
Findings
Performs competitively on length generalization tasks
Enables models trained on smaller contexts to generalize to larger contexts
Achieves perplexity improvements at longer sequence lengths
Abstract
The self-attention mechanism traditionally relies on the softmax operator, necessitating positional embeddings like RoPE, or position biases to account for token order. But current methods using still face length generalisation challenges. We investigate an alternative attention mechanism based on the stick-breaking process in larger scale settings. The method works as follows: For each token before the current, we determine a break point, which represents the proportion of the stick, the weight of the attention, to allocate to the current token. We repeat this on the remaining stick, until all tokens are allocated a weight, resulting in a sequence of attention weights. This process naturally incorporates recency bias, which has linguistic motivations for grammar parsing. We study the implications of replacing the conventional softmax-based attention mechanism with stick-breaking…
Peer Reviews
Decision·ICLR 2025 Poster
- The paper provides thorough and useful experimental results on the performance of stick-breaking attention, providing good exploration of its effectiveness for artificial and natural language tasks, for length generalization, etc. The experimentation seem well thought through and well done. - The paper is generally clear and easy to read. - The paper is very honest about what it contributes and what it uses from prior work. - The paper provides useful and new empirical results on different for
- The paper lacks originality in machine learning ideas. Stick-breaking attention has been previously explored by Yikang Shen (in multiple papers) and especially by Csordas et al. (2021), the latter under the name "Geometric attention". "Stick-breaking attention" is a better name for the model used, but the model is exactly the same as in these prior works, limiting the originality of this paper. The value is mainly in the more extensive experimentation, including showing performance on larger s
- The paper presents a novel addition to the zoo of "attention alternatives" by leveraging the stick breaking process, which performs both "attention" and positional embeddings intrinsically. The mechanism has a recency bias, meaning a token can prefer to allocate all its "energy" to few recent tokens, but it can also skip over and only attend to far-away tokens. - The paper also include details implementation in Triton for flash-attention style efficiency and speed-up optimization, which is hug
- Method can be explained more clearly with diagrams, formulation should be defined more thoroughly.
This softmax alternative is simple and induces a bias towards attending to recent positions in a very natural way. The experimental results all look strong.
As far as I can tell, stick-breaking attention is exactly the same as geometric attention (Csordas et al 2021), and stick-breaking was previously introduced by Shen et al (2023). Both papers are cited in the introduction, and the introduction concludes with an accurate list of the novel contributions of the paper. However, - The paper's short title "Stick-Breaking Attention" may give the impression that this is the first paper about stick-breaking attention. - The abstract does not mention prev
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Topic Modeling · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Softmax
