DeInfer: Efficient Parallel Inferencing for Decomposed Large Language Models
You-Liang Huang, Xinhao Huang, Chengxi Liao, Zeyi Wen

TL;DR
DeInfer is a specialized system designed to improve the speed and efficiency of parallel inference in decomposed large language models, addressing a key scalability challenge.
Contribution
It introduces a high-performance inference system with multiple optimizations tailored for decomposed LLMs, enhancing scalability and compatibility with existing techniques.
Findings
DeInfer significantly outperforms existing inference methods.
Experiments show improved parallel inference speed for decomposed LLMs.
DeInfer facilitates scalable deployment of large language models.
Abstract
Existing works on large language model (LLM) decomposition mainly focus on improving performance on downstream tasks, but they ignore the poor parallel inference performance when trying to scale up the model size. To mitigate this important performance issue, this paper introduces DeInfer, a high-performance inference system dedicated to parallel inference of decomposed LLMs. It consists of multiple optimizations to maximize performance and be compatible with state-of-the-art optimization techniques. Extensive experiments are carried out to evaluate DeInfer's performance, where the results demonstrate its superiority, suggesting it can greatly facilitate the parallel inference of decomposed LLMs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
