Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression
Wei Jiang, Wei Wang

TL;DR
This paper introduces sub-token routing within LoRA-adapted transformers, enabling finer and more effective compression of key-value representations for improved efficiency without sacrificing task accuracy.
Contribution
It proposes a novel sub-token routing method in LoRA transformers, combining query-independent and query-aware strategies for enhanced KV compression and model efficiency.
Findings
Query-independent subspace LoRA improves language-model quality with reduced KV budgets.
Query-aware sub-token routing preserves downstream performance under compression.
Combining token-level and sub-token routing enables deeper KV compression with minimal accuracy loss.
Abstract
Sub-token routing provides a finer compression axis for transformer efficiency than the coarse units used in most prior work, such as tokens, pages, heads, or layers. In this paper, we study routing within a token representation itself in LoRA-adapted transformers. We consider two settings. In the query-independent setting, we combine routed subspace LoRA with value-group routing on the KV path for compression-aware language modeling. In the query-aware setting, we use a predictor-based selector to allocate a global retention budget over context-token/value-group pairs using query-conditioned relevance. Experiments show that the query-independent design improves language-model quality under reduced KV budgets, while the query-aware design preserves downstream behavior well under KV compression. We further show that sub-token routing is most effective as a complementary compression axis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
