Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions
Zhihao He, Hang Yu, Zi Gong, Shizhan Liu, Jianguo Li, Weiyao Lin

TL;DR
Rodimus and Rodimus+ introduce a novel, efficient attention mechanism for large language models that significantly reduces computational costs while maintaining or improving performance, challenging the traditional accuracy-efficiency trade-off.
Contribution
The paper presents Rodimus and Rodimus+ with innovative attention mechanisms that enhance efficiency and accuracy in LLMs, including a data-dependent tempered selection and hybrid attention techniques.
Findings
Rodimus+-1.6B outperforms larger models trained on more data.
Achieves superior downstream performance with reduced memory usage.
Open-sourced code and models for community use.
Abstract
Recent advancements in Transformer-based large language models (LLMs) have set new standards in natural language processing. However, the classical softmax attention incurs significant computational costs, leading to a complexity for per-token generation, where represents the context length. This work explores reducing LLMs' complexity while maintaining performance by introducing Rodimus and its enhanced version, Rodimus. Rodimus employs an innovative data-dependent tempered selection (DDTS) mechanism within a linear attention-based, purely recurrent framework, achieving significant accuracy while drastically reducing the memory usage typically associated with recurrent models. This method exemplifies semantic compression by maintaining essential input information with fixed-size hidden states. Building on this, Rodimus combines Rodimus with the innovative Sliding…
Peer Reviews
Decision·ICLR 2025 Poster
1. The paper introduces two new techniques, data-dependent tempered selection (DDTS) and shared-key attention (SKA). DDTS is a new gating-based recurrent type for state-space models which enhances performance and increases parameter efficiency through reducing hidden state size. SKA shares a single key representation to reduce memory footprint while maintaining the multi-value setting of the original multi-head attention. 2. These techniques are evaluated extensively in general tasks (content a
1. While the main focus of the paper is to improve the trade-off in terms of accuracy and efficiency, the experiments do not discuss the efficiency aspects such as memory cost and latency or times for training/inference. 2. Even though the specific gates proposed haven't been used in prior work, the novelty in the design can be considered rather incremental as the formulation is very similar to previous ones (Mamba, S4D). 3. Direct comparisons with alternative gating functions in SSMs using th
the paper is generally well written and presented. ideas appear thought-through and are (somewhat) mathematically reasoned. more reasonable interplay of gating mechanics / ssm's in general and attention is pretty important to iterate towards more robust language models.
1) i understood the change to Rodimus+ is the two norms, SW-SKA and FFN per layer. how can this sum up to those negligible changes in parameters in Tab 2 - or do you use entirely different architecture configs? 2) i do not find anything conclusive about runtime estimates (only a vague big-O estimation and some chunking approach for training). This should not be optional! 3) my major concern is that only Fig 1 / Tab 1 / Fig 4 appear to be "somewhat fair" comparisons but in a ridiculous low scale.
- This work comprehensively examines the design space of gated linear attention (GLA), particularly focusing on an outer-product-based gating structure with input and forget gates. It addresses a gap in systematic studies on GLA's gating mechanisms, positioning itself well to bridge this gap. The proposed Data-Dependent Tempered Selection (DDTS) mechanism is both intuitively appealing and practically useful, as demonstrated by experimental results. Additionally, the definition of the selection m
- The work appears somewhat incremental, as it largely builds upon existing concepts. Technical contribution is limited. - Regarding language modeling experiments, the choice of the WikiText-103 dataset for language modeling may not be ideal, as it is relatively simplistic and sensitive to hyperparameter tuning. For Figure 1a, larger-scale experiments using more extensive datasets would provide a more robust evaluation. Additionally, the controlled experiment scale in Table 2 is too limited; i
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputability, Logic, AI Algorithms
MethodsAttention Is All You Need · Sparse Evolutionary Training · Softmax
