AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization
Qiyang Li, Rui Kong, Yuchen Li, Hengyi Cai, Shuaiqiang Wang, Linghe Kong, Guihai Chen, Dawei Yin

TL;DR
AdaFuse significantly reduces inference latency in dynamic adapter-enhanced LLMs by employing token-level pre-gating and fused kernel optimization, enabling faster decoding without sacrificing accuracy.
Contribution
AdaFuse introduces a novel token-level pre-gating strategy and fused kernel execution, co-designed with hardware, to optimize dynamic adapter inference in large language models.
Findings
Reduces decoding latency by over 2.4x in LLMs.
Maintains state-of-the-art accuracy with dynamic adapters.
Demonstrates effective hardware-software co-design for efficiency.
Abstract
The integration of dynamic, sparse structures like Mixture-of-Experts (MoE) with parameter-efficient adapters (e.g., LoRA) is a powerful technique for enhancing Large Language Models (LLMs). However, this architectural enhancement comes at a steep cost: despite minimal increases in computational load, the inference latency often skyrockets, leading to decoding speeds slowing by over 2.5 times. Through a fine-grained performance analysis, we pinpoint the primary bottleneck not in the computation itself, but in the severe overhead from fragmented, sequential CUDA kernel launches required for conventional dynamic routing. To address this challenge, we introduce AdaFuse, a framework built on a tight co-design between the algorithm and the underlying hardware system to enable efficient dynamic adapter execution. Departing from conventional layer-wise or block-wise routing, AdaFuse employs a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques
