NeuronMLP: Efficient LLM Inference via Singular Value Decomposition Compression and Tiling on AWS Trainium
Dinghong Song, Jierui Xu, Weichu Yang, Pengfei Su, Dong Li

TL;DR
NeuronMLP is a novel inference method for large language models on AWS Trainium, utilizing SVD compression and tiling to significantly improve speed by optimizing data movement and kernel execution.
Contribution
It introduces Trainium-specific techniques like kernel fusion and caching strategies for efficient MLP inference in LLMs using SVD-based compression.
Findings
Achieves 1.35x kernel-level speedup over AWS's NKI kernel.
Attains 1.21x end-to-end inference speedup on multiple LLMs.
Operates effectively at a compression ratio of 0.05.
Abstract
Emerging AI accelerators have started to gain attention and offer new opportunities for efficient inference of large language models (LLMs). Trainium, an AI accelerator recently developed by Amazon Web Services (AWS), provides an attractive option for LLM inference through its heterogeneous architecture. However, leveraging Trainium architecture for high performance can be challenging because of its systolic array architecture and special requirement on data layout. In this paper, we propose NeuronMLP, an efficient LLM inference method based on Singular Value Decomposition (SVD) compression and tiling on AWS Trainium. We introduce a series of techniques customized to Trainium based on kernel fusion and novel caching strategies to reduce data movement across the software-managed memory hierarchy, maximize SRAM bandwidth, and avoid expensive matrix transpose. The proposed method is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
