It's Not the Size: Harness Design Determines Operational Stability in Small Language Models

Yong-eun Cho

arXiv:2605.12129·cs.SE·May 13, 2026

It's Not the Size: Harness Design Determines Operational Stability in Small Language Models

Yong-eun Cho

PDF

TL;DR

This study demonstrates that harness design significantly influences the operational stability of small language models, with pipeline approaches vastly improving task success rates over raw prompts.

Contribution

It introduces a pipeline harness framework for small language models, showing substantial performance gains and revealing the impact of harness engineering on stability.

Findings

01

Pipeline harness achieves TSR=0.952 and VTSR=1.000 on Gemma4 E2B.

02

Minimal-shell harness can underperform compared to model-only in some cases.

03

Complex format requirements can cause scaffold collapse without proper harness support.

Abstract

This paper experimentally analyzes how the level of harness engineering affects the operational performance of small language models (SLMs, 2-3B parameters). Three harness conditions - model-only (raw prompt), minimal-shell (wrapper tags), and a 4-stage pipeline (plan->execute->verify->recover) - are applied to three models (Gemma4 E2B, Qwen3.5:2B, LLaMA 3.2 3B) across 24 tasks, comparing Task Success Rate (TSR) and Valid TSR (VTSR). The pipeline harness achieves TSR=0.952 and VTSR=1.000 on Gemma4 E2B (T1-T5, 21 tasks). A non-monotonic phenomenon - minimal-shell TSR < model-only TSR - is observed in two models. In LLaMA 3.2 3B model-only, seven format violations yield TSR=0.429, revealing scaffold collapse: the model abandons JSON structure under complex format requirements without harness support. Ablation shows planning and recovery each contribute approximately 24.7% of total gain.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.