FlashEVA: Accelerating LLM inference via Efficient Attention
Juan Gabriel Kostelec, Qinghai Guo

TL;DR
FlashEVA introduces an efficient attention mechanism for Transformers, significantly reducing memory usage and increasing inference throughput while maintaining performance across multiple tasks.
Contribution
It presents FlashEVA, a novel implementation of EVA attention, enabling faster and more memory-efficient Transformer inference with minimal fine-tuning data.
Findings
Up to 6.7x higher throughput during inference
5x reduction in peak GPU memory usage
Effective across various downstream tasks
Abstract
Transformer models have revolutionized natural language processing, achieving state-of-the-art performance and demonstrating remarkable scalability. However, their memory demands, particularly due to maintaining full context in memory, pose significant challenges for inference. In this paper, we present FlashEVA, an efficient implementation of EVA (Efficient Attention via Control Variates), and demonstrate how to finetune transformers to adapt to FlashEVA attention. Our method enables fine-tuning of Transformer models with as few as 1.5B tokens while preserving effectiveness across various downstream tasks. Notably, FlashEVA achieves up to 6.7x higher throughput and 5x lower peak GPU memory usage during inference compared to standard Transformer implementations. Despite these improvements, we observe limitations in retrieval-focused tasks. Our implementation offers control over the…
Peer Reviews
Decision·Submitted to ICLR 2025
+ The paper is well organized and presented. + Experimental results that the adjustable hyperparameters allow for a trade-off between throughput, accuracy and memory usage. + FlashEVA is compatible with existing optimized attention implementations and can leverage CUDA kernels for performance optimization.
- The proposed method is overly simplistic and unimpressive. It looks like an implementation of FlashAttention with EVA which is already proposed. - The experimental results are not persuasive since it doesn’t show the advantages compared to Dijiang and Sliding window as in Figure1. Instead, it is only a trade-off between Dijiang and Sling window.
This paper is well-organized and clear, effectively presenting complex ideas in efficient attention mechanisms. Background and motivation are well-integrated, and the experimental results are systematically laid out, with tables and figures that clarify performance gains. The discussion of trade-offs and limitations shows a balanced approach, enhancing the paper’s readability and impact.
A primary limitation of this paper is its lack of significant novelty beyond the existing EVA framework. While FlashEVA offers efficiency gains, these improvements are largely a result of optimizing existing CUDA/Triton kernels rather than introducing new concepts. As such, the contribution may appear incremental, particularly given the relatively modest improvements in throughput and memory efficiency. While the paper briefly compares FlashEVA with the Mamba model, it does not thoroughly exami
- The motivation, and the background about different forms of attentions are clear. - Extensive experiment results including different down-stream tasks. - The proposed methods achieve obvious better throughputs and memory consumption comparison compared with EVA and flash attention. - The proposed method can be finetuned from a standard attention models which makes it easier to use.
- The contribution of this work is an incremental work based on EVA. It is not a new algorithm but an efficient implementation of EVA. - Although the background of RFA, EVA are clearly explained, some background about flash attention could be included since it is more related and if the reader is not familiar. - In addition, more details should be given about the CUDA implementation, such as pseudo code and how the custom attention mask is achieved. Current presentation about the flashEVA is to
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Natural Language Processing Techniques · Topic Modeling
