DeInfer: Efficient Parallel Inferencing for Decomposed Large Language Models

You-Liang Huang; Xinhao Huang; Chengxi Liao; Zeyi Wen

arXiv:2604.17709·cs.CL·April 21, 2026

DeInfer: Efficient Parallel Inferencing for Decomposed Large Language Models

You-Liang Huang, Xinhao Huang, Chengxi Liao, Zeyi Wen

PDF

TL;DR

DeInfer is a specialized system designed to improve the speed and efficiency of parallel inference in decomposed large language models, addressing a key scalability challenge.

Contribution

It introduces a high-performance inference system with multiple optimizations tailored for decomposed LLMs, enhancing scalability and compatibility with existing techniques.

Findings

01

DeInfer significantly outperforms existing inference methods.

02

Experiments show improved parallel inference speed for decomposed LLMs.

03

DeInfer facilitates scalable deployment of large language models.

Abstract

Existing works on large language model (LLM) decomposition mainly focus on improving performance on downstream tasks, but they ignore the poor parallel inference performance when trying to scale up the model size. To mitigate this important performance issue, this paper introduces DeInfer, a high-performance inference system dedicated to parallel inference of decomposed LLMs. It consists of multiple optimizations to maximize performance and be compatible with state-of-the-art optimization techniques. Extensive experiments are carried out to evaluate DeInfer's performance, where the results demonstrate its superiority, suggesting it can greatly facilitate the parallel inference of decomposed LLMs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.