Disposition Distillation at Small Scale: A Three-Arc Negative Result
Hari Sadasivan (Tinman Lab)

TL;DR
This study attempted to imbue small language models with behavioral dispositions through various distillation and intervention techniques, but found no effective method that improves disposition without harming content quality.
Contribution
It provides a comprehensive negative result showing the difficulty of modifying dispositions in small language models and introduces a taxonomy of failure modes.
Findings
Falsified initial reported gains in disposition metrics.
No intervention improved dispositions without content damage.
Gemma 4 E2B shows confidence-correctness decoupling.
Abstract
We set out to train behavioral dispositions (self-verification, uncertainty acknowledgment, feedback integration) into small language models (0.6B to 2.3B effective parameters) through a four-stage all-MIT distillation pipeline, with follow-on experiments on inference-time attention-head interventions and a frozen-base confidence-gated sidecar. An internal draft reported +33.9-point MCAS and +15.3-point HumanEval gains on a Qwen3-0.6B student; a second-pass sanity check falsified both numbers before publication. The HumanEval delta was a truncation artifact (n_predict=512) that inverted to -8.0 points at n_predict=1024; the MCAS gain disappeared under apples-to-apples scoring. That falsification triggered three subsequent arcs. Across (1) SFT/DPO LoRA on three model families and two domains, (2) inference-time attention-head tempering on o_proj, and (3) a training-free frozen-base…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
