Procedural-skill SFT across capacity tiers: A W-Shaped pre-SFT Trajectory and Regime-Asymmetric Mechanism on 0.8B-4B Qwen3.5 Models
Igor Strozzi

TL;DR
This study measures the impact of supervised fine-tuning on procedural skills across different sizes of Qwen3.5 models, revealing a W-shaped pre-fine-tuning trajectory and regime-asymmetric effects.
Contribution
It uncovers a regime-asymmetric pattern in SFT effectiveness across model sizes and introduces a benchmark artifact and validation methodology.
Findings
SFT contributes roughly equally across model sizes (~+0.07)
Post-SFT gains vary with a W-shaped pre-training trajectory
SFT is most effective where the base model struggles, showing regime asymmetry.
Abstract
We measure procedural-skill SFT contribution across three Qwen3.5 dense scales (0.8B, 2B, 4B) on a 200-task / 40-skill holdout, with Claude Haiku 4.5 as a frontier reference. The corpus is 353 rows of (task + procedural-skill block, Opus chain-of-thought, judge-pass) demonstrations. Main finding. Under matched-path LLM-only scoring, the SFT-attributable procedural- lift is roughly uniform across sizes: at 0.8B / 2B / 4B. Variation in post-SFT (, , ) is dominated by a W-shaped pre-SFT base trajectory (, , , Haiku-4-5 at ): the 5-step procedure hurts 0.8B and 4B, helps 2B, and helps frontier Haiku modestly. SFT works hardest in absolute terms where the base struggles with the procedure -- a regime-asymmetric pattern with a falsifiable prediction at 8B/14B. Methodology. (i) A bench…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
