UniAttn: Reducing Inference Costs via Softmax Unification for Post-Training LLMs
Yizhe Xiong, Wei Huang, Xin Ye, Hui Chen, Zijia Lin, Haoran Lian, Zhenpeng Su, Jungong Han, Guiguang Ding

TL;DR
UniAttn introduces a novel post-training method that unifies Softmax activations across transformer blocks, significantly reducing inference costs while maintaining model performance.
Contribution
The paper proposes Softmax UniAttn, a new approach that unifies Softmax activations to lower inference costs in post-trained LLMs, outperforming existing methods.
Findings
Significant reduction in inference latency with UniAttn.
Maintains comparable performance to standard post-training.
Outperforms existing efficient architectures during post-training.
Abstract
Post-training is essential for adapting Large Language Models (LLMs) to real-world applications. Deploying post-trained models faces significant challenges due to substantial memory overhead and noticeable inference latency. Existing work has identified significant redundancies in LLMs and proposed efficient architectures, namely intra-layer KV sharing and cross-layer KV sharing. However, these methods still result in high inference time overhead, remaining suboptimal for post-training pre-trained LLMs. In this paper, we identify that the \texttt{Softmax} operation is a primary bottleneck for LLM inference and discover that it is actually highly redundant during post-training. We propose Softmax \textbf{Uni}fication in \textbf{Att}e\textbf{n}tion (\textbf{UniAttn}), a novel post-training method that unifies Softmax activations across transformer blocks to reduce LLM inference costs.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Reservoir Engineering and Simulation Methods · Simulation Techniques and Applications
