Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study

Shawn Tan; Songlin Yang; Aaron Courville; Rameswar Panda; Yikang Shen

arXiv:2410.17980·cs.LG·May 21, 2025

Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study

Shawn Tan, Songlin Yang, Aaron Courville, Rameswar Panda, Yikang Shen

PDF

Open Access 2 Repos 3 Reviews

TL;DR

This paper introduces a novel stick-breaking attention mechanism as an efficient alternative to softmax-based attention, demonstrating competitive performance and improved length generalization in large-scale models.

Contribution

The paper presents a new attention method based on stick-breaking, including implementation details and adaptation of Flash Attention, showing its effectiveness as a drop-in replacement.

Findings

01

Performs competitively on length generalization tasks

02

Enables models trained on smaller contexts to generalize to larger contexts

03

Achieves perplexity improvements at longer sequence lengths

Abstract

The self-attention mechanism traditionally relies on the softmax operator, necessitating positional embeddings like RoPE, or position biases to account for token order. But current methods using still face length generalisation challenges. We investigate an alternative attention mechanism based on the stick-breaking process in larger scale settings. The method works as follows: For each token before the current, we determine a break point, which represents the proportion of the stick, the weight of the attention, to allocate to the current token. We repeat this on the remaining stick, until all tokens are allocated a weight, resulting in a sequence of attention weights. This process naturally incorporates recency bias, which has linguistic motivations for grammar parsing. We study the implications of replacing the conventional softmax-based attention mechanism with stick-breaking…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The paper provides thorough and useful experimental results on the performance of stick-breaking attention, providing good exploration of its effectiveness for artificial and natural language tasks, for length generalization, etc. The experimentation seem well thought through and well done. - The paper is generally clear and easy to read. - The paper is very honest about what it contributes and what it uses from prior work. - The paper provides useful and new empirical results on different for

Weaknesses

- The paper lacks originality in machine learning ideas. Stick-breaking attention has been previously explored by Yikang Shen (in multiple papers) and especially by Csordas et al. (2021), the latter under the name "Geometric attention". "Stick-breaking attention" is a better name for the model used, but the model is exactly the same as in these prior works, limiting the originality of this paper. The value is mainly in the more extensive experimentation, including showing performance on larger s

Reviewer 02Rating 8Confidence 5

Strengths

- The paper presents a novel addition to the zoo of "attention alternatives" by leveraging the stick breaking process, which performs both "attention" and positional embeddings intrinsically. The mechanism has a recency bias, meaning a token can prefer to allocate all its "energy" to few recent tokens, but it can also skip over and only attend to far-away tokens. - The paper also include details implementation in Triton for flash-attention style efficiency and speed-up optimization, which is hug

Weaknesses

- Method can be explained more clearly with diagrams, formulation should be defined more thoroughly.

Reviewer 03Rating 6Confidence 4

Strengths

This softmax alternative is simple and induces a bias towards attending to recent positions in a very natural way. The experimental results all look strong.

Weaknesses

As far as I can tell, stick-breaking attention is exactly the same as geometric attention (Csordas et al 2021), and stick-breaking was previously introduced by Shen et al (2023). Both papers are cited in the introduction, and the introduction concludes with an accurate list of the novel contributions of the paper. However, - The paper's short title "Stick-Breaking Attention" may give the impression that this is the first paper about stick-breaking attention. - The abstract does not mention prev

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Topic Modeling · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Softmax