IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs

Yuzhen Mao; Martin Ester; Ke Li

arXiv:2405.02842·cs.LG·May 7, 2024

IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs

Yuzhen Mao, Martin Ester, Ke Li

PDF

Open Access 1 Video 3 Reviews

TL;DR

IceFormer introduces a novel inference acceleration method for long-sequence Transformers on CPUs, achieving significant speedups while maintaining high accuracy without retraining.

Contribution

The paper presents a new method to accelerate self-attention in pretrained Transformers for long sequences on CPUs without retraining.

Findings

01

Achieves 2.73x to 7.63x speedup in inference.

02

Retains 98.6% to 99.6% of original model accuracy.

03

Applicable to various long-sequence Transformer models.

Abstract

One limitation of existing Transformer-based models is that they cannot handle very long sequences as input since their self-attention operations exhibit quadratic time and space complexity. This problem becomes especially acute when Transformers are deployed on hardware platforms equipped only with CPUs. To address this issue, we propose a novel method for accelerating self-attention at inference time that works with pretrained Transformer models out-of-the-box without requiring retraining. We experiment using our method to accelerate various long-sequence Transformers, including a leading LLaMA 2-based LLM, on various benchmarks and demonstrate a greater speedup of 2.73x - 7.63x while retaining 98.6% - 99.6% of the accuracy of the original pretrained models. The code is available on our project website at https://yuzhenmao.github.io/IceFormer/.

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. The paper sheds new insights on an important problem -- choosing the top-k keys in sparse attention for inference acceleration. The authors show that the exact kNNS algorithm is critical for the success of sparse attention. 2. The paper is well-written and easy to follow. 3. The evaluation section shows strong performance compared to other efficient attention works.

Weaknesses

1. The use of Prioritized DCI k-NNS algorithm needs more experimental or theoretical justification. The authors claim "ranking-based algorithms is better aligned with how attention weights", if so, how would other ranking-based algorithms perform? On top of that, the authors show an evaluation of different kNNS algorithms on fashion-mnist-784 dataset in Section 5.1. It would be better to show the exact setup (eg. model architecture) they used, and compare them on a few more tasks (for instance,

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

1. The paper is well-motivated, tackling a pertinent issue in the deployment of large language models. 2. The idea of employing ranking-based algorithms over bucketing-based algorithms presents an interesting potential for complexity reduction.

Weaknesses

1. The paper does not adequately support its claim that Prioritized DCI outperforms LSH, lacking both theoretical and empirical evidence. 2. There is insufficient clarity in the algorithm's implementation details, making it difficult to understand the actual complexity and the mechanics of the proposed method. 3. The evaluation methodology for measuring inference time is not comprehensive. It appears the method is optimized for CPUs but lacks evidence of similar efficacy on GPUs.

Reviewer 03Rating 3· reject, not good enoughConfidence 3

Strengths

- This method can accelerate inference time (only CPU) for pretrained transformers without the need for expensive and time-consuming retraining. - Unlike some other methods, IceFormer ensures that there is minimal approximation error, crucial for LLMs where errors in initial layers can cascade through subsequent ones. - Beyond just accuracy, the method also guarantees rapid inference times, making it particularly suitable for CPUs. - By capitalizing on the sparsity of the attention matrix and ut

Weaknesses

This paper has multiple concerns for acceptance at ICLR 2024: 1) The most glaring issue is its reliance on outdated methods from 2020 and 2021, with some even referencing the 2017 vanilla transformer. This dated focus suggests a lack of recent advancements in the field. One must ponder why the topic of 'efficient transformers' isn't garnering contemporary attention. Historically, efforts to enhance transformer efficiency via attention layer optimization waned with the introduction of Large Lang

Videos

IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs· slideslive

Taxonomy

TopicsParallel Computing and Optimization Techniques · Scientific Computing and Data Management · Computational Physics and Python Applications

MethodsAttention Is All You Need · Dense Connections · Dropout · Label Smoothing · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Linear Layer