TL;DR
TurboMind, integrated into LMDeploy, significantly accelerates mixed-precision large language model inference by optimizing hardware-aware pipelines, leading to substantial latency reductions and throughput improvements across diverse models and GPU architectures.
Contribution
Introduces TurboMind, a hardware-aware, generalizable mixed-precision inference engine that optimizes matrix and attention operations for diverse hardware and precision formats.
Findings
Achieves up to 61% lower latency and 156% higher throughput in mixed-precision LLM inference.
Demonstrates consistent performance improvements across 16 LLMs and 4 GPU architectures.
Open-sourced at https://github.com/InternLM/lmdeploy.
Abstract
Mixed-precision inference techniques reduce the memory and computational demands of Large Language Models (LLMs) by applying hybrid precision formats to model weights, activations, and KV caches. However, existing systems struggle to (i) automatically generalize across diverse hardware architectures and precision formats, often requiring fragmented, hand-tuned kernels, and (ii) fully exploit available memory and compute resources, often causing performance bottlenecks. To address these problems, we propose TurboMind, a generalizable and efficient mixed-precision LLM inference engine of LMDeploy. TurboMind is built around two hardware-aware mixed-precision pipelines: A General Matrix Multiply (GEMM) pipeline that optimizes matrix operations through offline weight packing and online acceleration, and an attention pipeline that enables efficient attention computation with different Query,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Parallel Computing and Optimization Techniques · Natural Language Processing Techniques
