Procedural-skill SFT across capacity tiers: A W-Shaped pre-SFT Trajectory and Regime-Asymmetric Mechanism on 0.8B-4B Qwen3.5 Models

Igor Strozzi

arXiv:2605.11907·cs.LG·May 15, 2026

Procedural-skill SFT across capacity tiers: A W-Shaped pre-SFT Trajectory and Regime-Asymmetric Mechanism on 0.8B-4B Qwen3.5 Models

Igor Strozzi

PDF

TL;DR

This study measures the impact of supervised fine-tuning on procedural skills across different sizes of Qwen3.5 models, revealing a W-shaped pre-fine-tuning trajectory and regime-asymmetric effects.

Contribution

It uncovers a regime-asymmetric pattern in SFT effectiveness across model sizes and introduces a benchmark artifact and validation methodology.

Findings

01

SFT contributes roughly equally across model sizes (~+0.07)

02

Post-SFT gains vary with a W-shaped pre-training trajectory

03

SFT is most effective where the base model struggles, showing regime asymmetry.

Abstract

We measure procedural-skill SFT contribution across three Qwen3.5 dense scales (0.8B, 2B, 4B) on a 200-task / 40-skill holdout, with Claude Haiku 4.5 as a frontier reference. The corpus is 353 rows of (task + procedural-skill block, Opus chain-of-thought, judge-pass) demonstrations. Main finding. Under matched-path LLM-only scoring, the SFT-attributable procedural- $Δ$ lift is roughly uniform across sizes: $+ 0.070/ + 0.040/ + 0.075$ at 0.8B / 2B / 4B. Variation in post-SFT $Δ$ ( $- 0.005$ , $+ 0.100$ , $+ 0.065$ ) is dominated by a W-shaped pre-SFT base trajectory ( $- 0.075$ , $+ 0.060$ , $- 0.010$ , Haiku-4-5 at $+ 0.030$ ): the 5-step procedure hurts 0.8B and 4B, helps 2B, and helps frontier Haiku modestly. SFT works hardest in absolute terms where the base struggles with the procedure -- a regime-asymmetric pattern with a falsifiable prediction at 8B/14B. Methodology. (i) A bench…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.