CoreInfer: Accelerating Large Language Model Inference with   Semantics-Inspired Adaptive Sparse Activation

Qinsi Wang; Saeed Vahidian; Hancheng Ye; Jianyang Gu; Jianyi Zhang,; Yiran Chen

arXiv:2410.18311·cs.LG·October 25, 2024·2 cites

CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation

Qinsi Wang, Saeed Vahidian, Hancheng Ye, Jianyang Gu, Jianyi Zhang,, Yiran Chen

PDF

Open Access

TL;DR

CoreInfer introduces a sentence-level, semantics-inspired adaptive sparse activation method for large language models, significantly accelerating inference by predicting core neurons without additional MLP computations.

Contribution

It proposes a novel MLP-free, sentence-wise core neuron prediction approach based on semantic stability, enabling zero-cost sparse inference for LLMs.

Findings

01

Achieved 10.33x speedup over Huggingface implementation.

02

Demonstrated effectiveness across various models and tasks.

03

Core neurons show stability and semantic similarity, validating the approach.

Abstract

Large language models (LLMs) with billions of parameters have sparked a new wave of exciting AI applications. However, their high computational costs and memory demands during inference pose significant challenges. Adaptive sparse activation inference, which activates only a small number of neurons for each token, offers a novel way to accelerate model inference without degrading performance, showing great potential for resource-constrained hardware devices. Nevertheless, existing methods predict activated neurons based on individual tokens with additional MLP, which involve frequent changes in activation maps and resource calls, limiting the acceleration benefits of sparse activation. In this paper, we introduce CoreInfer, an MLP-free adaptive sparse activation inference method based on sentence-level prediction. Specifically, we propose the concept of sentence-wise core neurons, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis