S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

Jack Young

arXiv:2604.01168·cs.CL·April 7, 2026

S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

Jack Young

PDF

1 Repo 2 Models 1 Datasets

TL;DR

S0 tuning is a zero-overhead method for adapting hybrid recurrent-attention models by tuning a single state matrix per recurrent layer, outperforming LoRA on several benchmarks with minimal overhead.

Contribution

The paper introduces S0 tuning, a novel zero-inference-overhead PEFT method that tunes only the initial state matrices of recurrent layers in hybrid models.

Findings

01

S0 tuning outperforms LoRA on HumanEval by +10.8 pp.

02

S0 tuning improves Qwen3.5-4B performance by +23.6 pp.

03

State initialization is a strong PEFT surface for hybrid models.

Abstract

Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval. The method, which we call S0 tuning, optimizes one state matrix per recurrent layer while freezing all model weights. On Qwen3.5-4B (GatedDeltaNet hybrid), S0 tuning improves greedy pass@1 by +23.6 +/- 1.7 pp (10 seeds). On FalconH1-7B (Mamba-2 hybrid), S0 reaches 71.8% +/- 1.3 and LoRA reaches 71.4% +/- 2.4 (3 seeds), statistically indistinguishable at this sample size while requiring no weight merging. Cross-domain transfer is significant on MATH-500 (+4.8 pp, p = 0.00002, 8 seeds) and GSM8K (+2.8 pp, p = 0.0003, 10 seeds); a text-to-SQL benchmark (Spider) shows no transfer, consistent with the trajectory-steering mechanism. A prefix-tuning control on a pure Transformer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jackyoung27/s0-tuning
github

Models

Datasets

JackYoung27/humaneval-s0-train
dataset· 37 dl
37 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.