LiveMind: Low-latency Large Language Models with Simultaneous Inference

Chuangtao Chen; Grace Li Zhang; Xunzhao Yin; Cheng Zhuo and; Ulf Schlichtmann; Bing Li

arXiv:2406.14319·cs.AI·November 7, 2024·1 cites

LiveMind: Low-latency Large Language Models with Simultaneous Inference

Chuangtao Chen, Grace Li Zhang, Xunzhao Yin, Cheng Zhuo and, Ulf Schlichtmann, Bing Li

PDF

Open Access 1 Repo

TL;DR

LiveMind introduces a low-latency inference framework for large language models that reduces response times significantly by processing incomplete inputs and enabling collaborative inference, improving user interaction efficiency.

Contribution

The paper presents a novel framework that reallocates computation to reduce latency and supports inference from incomplete inputs, enhancing real-time LLM interactions.

Findings

01

84.0% latency reduction on MMLU dataset

02

71.6% latency reduction on MMLU-Pro dataset

03

37% latency reduction using collaborative inference

Abstract

In this paper, we introduce LiveMind, a novel low-latency inference framework for large language model (LLM) inference which enables LLMs to perform inferences with incomplete user input. By reallocating computational processes to the input phase, a substantial reduction in latency is achieved, thereby significantly enhancing the interactive experience for users of LLMs. The framework adeptly manages the visibility of the streaming input to the model, allowing it to infer from incomplete user input or await additional content. Compared with traditional inference methods on complete user input, our approach demonstrates an average reduction in response latency of 84.0% on the MMLU dataset and 71.6% on the MMLU-Pro dataset, while maintaining comparable accuracy. Additionally, our framework facilitates collaborative inference and output across different models. By employing an large LLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chuangtaochen-tum/livemind
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis