HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models
Abhinaba Basu

TL;DR
HubRouter introduces a pluggable, efficient routing module that replaces quadratic attention with a linear-like approach, improving training throughput and maintaining competitive performance in sequence models.
Contribution
It presents a novel hub-mediated routing mechanism that reduces attention complexity from quadratic to sub-quadratic, with demonstrated improvements in training speed and model perplexity.
Findings
HubRouter achieves up to 90x training throughput at sequence length 1024.
Replacing 25% of attention layers with HubRouter improves perplexity.
Optimal hub count (8-14) is identified for stable convergence.
Abstract
We introduce HubRouter, a pluggable module that replaces O(n^2) attention layers with O(nM) hub-mediated routing, where M << n is a small number of learned hub tokens. We demonstrate it in two from-scratch architectures: a Jamba-style hybrid and a 12-layer Transformer; retrofit into pretrained models is a tested negative case. HubRouter implements an encode-decode-score-council pipeline: M learned hubs cross-attend to all tokens, tokens project against hubs for routing fingerprints, a score head selects top-k tokens, and a sparse council attends only to the selected subset. We validate HubRouter in three settings. (1) Hub-Jamba yields a nominal 4.2% PPL improvement (200.2 vs 209.0, single seed; possibly within seed noise) and up to ~90x training throughput at sequence length 1024 in matched PyTorch-native baselines; an optimised baseline would narrow this to ~10-15x. (2) Graduated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
