It's Not the Size: Harness Design Determines Operational Stability in Small Language Models
Yong-eun Cho

TL;DR
This study demonstrates that harness design significantly influences the operational stability of small language models, with pipeline approaches vastly improving task success rates over raw prompts.
Contribution
It introduces a pipeline harness framework for small language models, showing substantial performance gains and revealing the impact of harness engineering on stability.
Findings
Pipeline harness achieves TSR=0.952 and VTSR=1.000 on Gemma4 E2B.
Minimal-shell harness can underperform compared to model-only in some cases.
Complex format requirements can cause scaffold collapse without proper harness support.
Abstract
This paper experimentally analyzes how the level of harness engineering affects the operational performance of small language models (SLMs, 2-3B parameters). Three harness conditions - model-only (raw prompt), minimal-shell (wrapper tags), and a 4-stage pipeline (plan->execute->verify->recover) - are applied to three models (Gemma4 E2B, Qwen3.5:2B, LLaMA 3.2 3B) across 24 tasks, comparing Task Success Rate (TSR) and Valid TSR (VTSR). The pipeline harness achieves TSR=0.952 and VTSR=1.000 on Gemma4 E2B (T1-T5, 21 tasks). A non-monotonic phenomenon - minimal-shell TSR < model-only TSR - is observed in two models. In LLaMA 3.2 3B model-only, seven format violations yield TSR=0.429, revealing scaffold collapse: the model abandons JSON structure under complex format requirements without harness support. Ablation shows planning and recovery each contribute approximately 24.7% of total gain.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
