S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
Jack Young

TL;DR
S0 tuning is a zero-overhead method for adapting hybrid recurrent-attention models by tuning a single state matrix per recurrent layer, outperforming LoRA on several benchmarks with minimal overhead.
Contribution
The paper introduces S0 tuning, a novel zero-inference-overhead PEFT method that tunes only the initial state matrices of recurrent layers in hybrid models.
Findings
S0 tuning outperforms LoRA on HumanEval by +10.8 pp.
S0 tuning improves Qwen3.5-4B performance by +23.6 pp.
State initialization is a strong PEFT surface for hybrid models.
Abstract
Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval. The method, which we call S0 tuning, optimizes one state matrix per recurrent layer while freezing all model weights. On Qwen3.5-4B (GatedDeltaNet hybrid), S0 tuning improves greedy pass@1 by +23.6 +/- 1.7 pp (10 seeds). On FalconH1-7B (Mamba-2 hybrid), S0 reaches 71.8% +/- 1.3 and LoRA reaches 71.4% +/- 2.4 (3 seeds), statistically indistinguishable at this sample size while requiring no weight merging. Cross-domain transfer is significant on MATH-500 (+4.8 pp, p = 0.00002, 8 seeds) and GSM8K (+2.8 pp, p = 0.0003, 10 seeds); a text-to-SQL benchmark (Spider) shows no transfer, consistent with the trajectory-steering mechanism. A prefix-tuning control on a pure Transformer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
