LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind

Li Zhang; Youhe Jiang; Guoliang He; Xin Chen; Han Lv; Qian Yao; Ningsheng Ma; Fangcheng Fu; Kai Chen

arXiv:2508.15601·cs.DC·May 18, 2026

LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind

Li Zhang, Youhe Jiang, Guoliang He, Xin Chen, Han Lv, Qian Yao, Ningsheng Ma, Fangcheng Fu, Kai Chen

PDF

1 Repo

TL;DR

TurboMind, integrated into LMDeploy, significantly accelerates mixed-precision large language model inference by optimizing hardware-aware pipelines, leading to substantial latency reductions and throughput improvements across diverse models and GPU architectures.

Contribution

Introduces TurboMind, a hardware-aware, generalizable mixed-precision inference engine that optimizes matrix and attention operations for diverse hardware and precision formats.

Findings

01

Achieves up to 61% lower latency and 156% higher throughput in mixed-precision LLM inference.

02

Demonstrates consistent performance improvements across 16 LLMs and 4 GPU architectures.

03

Open-sourced at https://github.com/InternLM/lmdeploy.

Abstract

Mixed-precision inference techniques reduce the memory and computational demands of Large Language Models (LLMs) by applying hybrid precision formats to model weights, activations, and KV caches. However, existing systems struggle to (i) automatically generalize across diverse hardware architectures and precision formats, often requiring fragmented, hand-tuned kernels, and (ii) fully exploit available memory and compute resources, often causing performance bottlenecks. To address these problems, we propose TurboMind, a generalizable and efficient mixed-precision LLM inference engine of LMDeploy. TurboMind is built around two hardware-aware mixed-precision pipelines: A General Matrix Multiply (GEMM) pipeline that optimizes matrix operations through offline weight packing and online acceleration, and an attention pipeline that enables efficient attention computation with different Query,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

InternLM/lmdeploy
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Parallel Computing and Optimization Techniques · Natural Language Processing Techniques