AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization

Qiyang Li; Rui Kong; Yuchen Li; Hengyi Cai; Shuaiqiang Wang; Linghe Kong; Guihai Chen; Dawei Yin

arXiv:2603.11873·cs.AI·March 13, 2026

AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization

Qiyang Li, Rui Kong, Yuchen Li, Hengyi Cai, Shuaiqiang Wang, Linghe Kong, Guihai Chen, Dawei Yin

PDF

Open Access 1 Video

TL;DR

AdaFuse significantly reduces inference latency in dynamic adapter-enhanced LLMs by employing token-level pre-gating and fused kernel optimization, enabling faster decoding without sacrificing accuracy.

Contribution

AdaFuse introduces a novel token-level pre-gating strategy and fused kernel execution, co-designed with hardware, to optimize dynamic adapter inference in large language models.

Findings

01

Reduces decoding latency by over 2.4x in LLMs.

02

Maintains state-of-the-art accuracy with dynamic adapters.

03

Demonstrates effective hardware-software co-design for efficiency.

Abstract

The integration of dynamic, sparse structures like Mixture-of-Experts (MoE) with parameter-efficient adapters (e.g., LoRA) is a powerful technique for enhancing Large Language Models (LLMs). However, this architectural enhancement comes at a steep cost: despite minimal increases in computational load, the inference latency often skyrockets, leading to decoding speeds slowing by over 2.5 times. Through a fine-grained performance analysis, we pinpoint the primary bottleneck not in the computation itself, but in the severe overhead from fragmented, sequential CUDA kernel launches required for conventional dynamic routing. To address this challenge, we introduce AdaFuse, a framework built on a tight co-design between the algorithm and the underlying hardware system to enable efficient dynamic adapter execution. Departing from conventional layer-wise or block-wise routing, AdaFuse employs a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization· underline

Taxonomy

TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques