Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression

Wei Jiang; Wei Wang

arXiv:2604.21335·cs.LG·May 7, 2026

Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression

Wei Jiang, Wei Wang

PDF

TL;DR

This paper introduces sub-token routing within LoRA-adapted transformers, enabling finer and more effective compression of key-value representations for improved efficiency without sacrificing task accuracy.

Contribution

It proposes a novel sub-token routing method in LoRA transformers, combining query-independent and query-aware strategies for enhanced KV compression and model efficiency.

Findings

01

Query-independent subspace LoRA improves language-model quality with reduced KV budgets.

02

Query-aware sub-token routing preserves downstream performance under compression.

03

Combining token-level and sub-token routing enables deeper KV compression with minimal accuracy loss.

Abstract

Sub-token routing provides a finer compression axis for transformer efficiency than the coarse units used in most prior work, such as tokens, pages, heads, or layers. In this paper, we study routing within a token representation itself in LoRA-adapted transformers. We consider two settings. In the query-independent setting, we combine routed subspace LoRA with value-group routing on the KV path for compression-aware language modeling. In the query-aware setting, we use a predictor-based selector to allocate a global retention budget over context-token/value-group pairs using query-conditioned relevance. Experiments show that the query-independent design improves language-model quality under reduced KV budgets, while the query-aware design preserves downstream behavior well under KV compression. We further show that sub-token routing is most effective as a complementary compression axis…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.