TL;DR
COREY introduces an entropy-guided runtime scheduler for selective scan kernels in state space models, improving latency on GPUs but not consistently enhancing throughput over static tuning.
Contribution
It presents a novel entropy-based scheduling method that matches static oracle performance at the kernel level and explores its practical implications in GPU workloads.
Findings
Achieves 4.41× lower latency than unoptimized baseline on consumer GPU.
Matches locally optimal chunk size using entropy rule, comparable to static oracle.
Entropy-guided scheduling incurs overhead but can be mitigated with fallback strategies.
Abstract
Mamba selective state space models (SSMs) provide linear-time sequence modeling but remain sensitive to selective-scan chunk scheduling. We present COREY, a \emph{concept-and-feasibility} runtime scheduler that maps fixed-bin activation entropy to chunk size. We evaluate COREY in three tiers: a prototype cost model, real-checkpoint kernel timing, and routed end-to-end ablations on modern GPUs. At the kernel level, a calibrated rule, \(H_{\mathrm{ref}}=\log K\), recovers the locally optimal chunk and matches a one-time static oracle, yielding \(4.41\times\) lower latency than an unoptimized baseline on a consumer GPU and \(3.90\times\)--\(4.04\times\) lower latency on a data-center accelerator. Routing this choice into a patched live scan kernel closes the engineering loop without improving end-to-end speed: in unified routed ablations, the best static chunk outperforms all…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
