Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions

Zhihao He; Hang Yu; Zi Gong; Shizhan Liu; Jianguo Li; Weiyao Lin

arXiv:2410.06577·cs.CL·May 20, 2025

Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions

Zhihao He, Hang Yu, Zi Gong, Shizhan Liu, Jianguo Li, Weiyao Lin

PDF

Open Access 2 Repos 4 Models 3 Reviews

TL;DR

Rodimus and Rodimus+ introduce a novel, efficient attention mechanism for large language models that significantly reduces computational costs while maintaining or improving performance, challenging the traditional accuracy-efficiency trade-off.

Contribution

The paper presents Rodimus and Rodimus+ with innovative attention mechanisms that enhance efficiency and accuracy in LLMs, including a data-dependent tempered selection and hybrid attention techniques.

Findings

01

Rodimus+-1.6B outperforms larger models trained on more data.

02

Achieves superior downstream performance with reduced memory usage.

03

Open-sourced code and models for community use.

Abstract

Recent advancements in Transformer-based large language models (LLMs) have set new standards in natural language processing. However, the classical softmax attention incurs significant computational costs, leading to a $O (T)$ complexity for per-token generation, where $T$ represents the context length. This work explores reducing LLMs' complexity while maintaining performance by introducing Rodimus and its enhanced version, Rodimus $+$ . Rodimus employs an innovative data-dependent tempered selection (DDTS) mechanism within a linear attention-based, purely recurrent framework, achieving significant accuracy while drastically reducing the memory usage typically associated with recurrent models. This method exemplifies semantic compression by maintaining essential input information with fixed-size hidden states. Building on this, Rodimus $+$ combines Rodimus with the innovative Sliding…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper introduces two new techniques, data-dependent tempered selection (DDTS) and shared-key attention (SKA). DDTS is a new gating-based recurrent type for state-space models which enhances performance and increases parameter efficiency through reducing hidden state size. SKA shares a single key representation to reduce memory footprint while maintaining the multi-value setting of the original multi-head attention. 2. These techniques are evaluated extensively in general tasks (content a

Weaknesses

1. While the main focus of the paper is to improve the trade-off in terms of accuracy and efficiency, the experiments do not discuss the efficiency aspects such as memory cost and latency or times for training/inference. 2. Even though the specific gates proposed haven't been used in prior work, the novelty in the design can be considered rather incremental as the formulation is very similar to previous ones (Mamba, S4D). 3. Direct comparisons with alternative gating functions in SSMs using th

Reviewer 02Rating 6Confidence 4

Strengths

the paper is generally well written and presented. ideas appear thought-through and are (somewhat) mathematically reasoned. more reasonable interplay of gating mechanics / ssm's in general and attention is pretty important to iterate towards more robust language models.

Weaknesses

1) i understood the change to Rodimus+ is the two norms, SW-SKA and FFN per layer. how can this sum up to those negligible changes in parameters in Tab 2 - or do you use entirely different architecture configs? 2) i do not find anything conclusive about runtime estimates (only a vague big-O estimation and some chunking approach for training). This should not be optional! 3) my major concern is that only Fig 1 / Tab 1 / Fig 4 appear to be "somewhat fair" comparisons but in a ridiculous low scale.

Reviewer 03Rating 6Confidence 5

Strengths

- This work comprehensively examines the design space of gated linear attention (GLA), particularly focusing on an outer-product-based gating structure with input and forget gates. It addresses a gap in systematic studies on GLA's gating mechanisms, positioning itself well to bridge this gap. The proposed Data-Dependent Tempered Selection (DDTS) mechanism is both intuitively appealing and practically useful, as demonstrated by experimental results. Additionally, the definition of the selection m

Weaknesses

- The work appears somewhat incremental, as it largely builds upon existing concepts. Technical contribution is limited. - Regarding language modeling experiments, the choice of the WikiText-103 dataset for language modeling may not be ideal, as it is relatively simplistic and sensitive to hyperparameter tuning. For Figure 1a, larger-scale experiments using more extensive datasets would provide a more robust evaluation. Additionally, the controlled experiment scale in Table 2 is too limited; i

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputability, Logic, AI Algorithms

MethodsAttention Is All You Need · Sparse Evolutionary Training · Softmax