IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs
Yuzhen Mao, Martin Ester, Ke Li

TL;DR
IceFormer introduces a novel inference acceleration method for long-sequence Transformers on CPUs, achieving significant speedups while maintaining high accuracy without retraining.
Contribution
The paper presents a new method to accelerate self-attention in pretrained Transformers for long sequences on CPUs without retraining.
Findings
Achieves 2.73x to 7.63x speedup in inference.
Retains 98.6% to 99.6% of original model accuracy.
Applicable to various long-sequence Transformer models.
Abstract
One limitation of existing Transformer-based models is that they cannot handle very long sequences as input since their self-attention operations exhibit quadratic time and space complexity. This problem becomes especially acute when Transformers are deployed on hardware platforms equipped only with CPUs. To address this issue, we propose a novel method for accelerating self-attention at inference time that works with pretrained Transformer models out-of-the-box without requiring retraining. We experiment using our method to accelerate various long-sequence Transformers, including a leading LLaMA 2-based LLM, on various benchmarks and demonstrate a greater speedup of 2.73x - 7.63x while retaining 98.6% - 99.6% of the accuracy of the original pretrained models. The code is available on our project website at https://yuzhenmao.github.io/IceFormer/.
Peer Reviews
Decision·ICLR 2024 poster
1. The paper sheds new insights on an important problem -- choosing the top-k keys in sparse attention for inference acceleration. The authors show that the exact kNNS algorithm is critical for the success of sparse attention. 2. The paper is well-written and easy to follow. 3. The evaluation section shows strong performance compared to other efficient attention works.
1. The use of Prioritized DCI k-NNS algorithm needs more experimental or theoretical justification. The authors claim "ranking-based algorithms is better aligned with how attention weights", if so, how would other ranking-based algorithms perform? On top of that, the authors show an evaluation of different kNNS algorithms on fashion-mnist-784 dataset in Section 5.1. It would be better to show the exact setup (eg. model architecture) they used, and compare them on a few more tasks (for instance,
1. The paper is well-motivated, tackling a pertinent issue in the deployment of large language models. 2. The idea of employing ranking-based algorithms over bucketing-based algorithms presents an interesting potential for complexity reduction.
1. The paper does not adequately support its claim that Prioritized DCI outperforms LSH, lacking both theoretical and empirical evidence. 2. There is insufficient clarity in the algorithm's implementation details, making it difficult to understand the actual complexity and the mechanics of the proposed method. 3. The evaluation methodology for measuring inference time is not comprehensive. It appears the method is optimized for CPUs but lacks evidence of similar efficacy on GPUs.
- This method can accelerate inference time (only CPU) for pretrained transformers without the need for expensive and time-consuming retraining. - Unlike some other methods, IceFormer ensures that there is minimal approximation error, crucial for LLMs where errors in initial layers can cascade through subsequent ones. - Beyond just accuracy, the method also guarantees rapid inference times, making it particularly suitable for CPUs. - By capitalizing on the sparsity of the attention matrix and ut
This paper has multiple concerns for acceptance at ICLR 2024: 1) The most glaring issue is its reliance on outdated methods from 2020 and 2021, with some even referencing the 2017 vanilla transformer. This dated focus suggests a lack of recent advancements in the field. One must ponder why the topic of 'efficient transformers' isn't garnering contemporary attention. Historically, efforts to enhance transformer efficiency via attention layer optimization waned with the introduction of Large Lang
Videos
Taxonomy
TopicsParallel Computing and Optimization Techniques · Scientific Computing and Data Management · Computational Physics and Python Applications
MethodsAttention Is All You Need · Dense Connections · Dropout · Label Smoothing · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Linear Layer
